Fig.1 The 5-Nearest Neighbors of node ‘a’ and each node’s 5-distance vector

I have been writing and posting R scripts to code up the solutions to the problems posed by John Foreman, Chief Data Scientist at MailChimp, in his book, Data Smart [http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html].

In chapter 9, the author explains and develops the Local Outlier Factor approach to identifying outliers in multi-dimensional data. He does this using Excel, as usual. The problem the author posed was to identify outliers among a group of 400 call center employees given their job performance data.

In chapter 10, he presents the R-code solution for this problem. It requires just five lines of code using the lofactor() function from the DMwR package. The solution shown is the same as the author’s except for name changes.

I am posting the solution here because the Local Outlier Factor approach is such an interesting one and because it is effective and easy to comprehend. Enjoy!

# Load DMwR package, which accompanies the book 'Data Mining with R', by Luis Torgo
# It contains the function necessary to calculate the Local Outlier Factors
library(DMwR) # contains the lofactor function

# Read source data
callCenter <- read.csv("call-center.csv")

# Normalize the data
callCenterSc <- scale(callCenter[2:11])

# calculate the local outlier factor (lof). We choose to limit the graph to k = 5 Nearest Neighbors this time. 
lof <- lofactor(callCenterSc, 5)

# look at the employees for which lof > 1.5. A value of 1 signifies that a node and its neighbors are similarly distant from each other. The process identifies two exceptional employees, one with especially generous and flexible performance and another whose performance lags across multiple dimensions.
callCenter[which(lof > 1.5),]
##     Employee.ID Avg.Tix...Day Customer.rating Tardies
## 299      137155         165.3            4.49       1
## 374      143406         145.0            2.33       3
##     Graveyard.Shifts.Taken Weekend.Shifts.Taken Sick.Days.Taken
## 299                      3                    2               1
## 374                      1                    0               6
##     X..Sick.Days.Taken.on.Friday Employee.Dev..Hours Shift.Swaps.Requested
## 299                         0.00                  30                     1
## 374                         0.83                  30                     4
##     Shift.Swaps.Offered
## 299                   7
## 374                   0