EXPLORATORY ANALYSIS

For this study on detecting outliers in R, we will be using two csv files in this section. The first one is “PregnancyDuration.csv” and “CallCenter.csv” available for download at www.wiley.com/go/datasmart. To start we are first going to read the data set on pregnancy duration in the “PregnancyDuration.csv” saved in our clipboard.

# Reading and naming our dataset using the read.csv() function
# I have already read the data and named it pregnancy.df()
# Let's check the structure of our data.
str(pregnancy.df)
'data.frame':   1000 obs. of  1 variable:
 $ GestationDays: int  349 278 266 265 269 263 278 257 268 260 ...

The dataset has 1000 observations and one variable, which is gestation days. The only variable is a numeric-continuous type.

# Checking the class of our data set using the class() function
class(pregnancy.df)
[1] "data.frame"

pregnancy.df is a data frame. A data frame is a list of variables of the same number of rows with unique row names, given class “data.frame”. If no variables are included, the row names determine the number of rows. Our data frame is a one-dimensional labelled data with only one column and multiple rows. How can we be sure that we are right?

# Find the number of column and rows
class(pregnancy.df); ncol(pregnancy.df); nrow(pregnancy.df)
[1] "data.frame"
[1] 1
[1] 1000

You clearly see that we have one column and 1000 rows, which makes it a one-dimensional array. You can go further by checking the dimension of your dataframe.

# Checking the dimension of pregnancyDuration data set using dim() function
dim(pregnancy.df)
[1] 1000    1

With a single code, you get the same result as the previous ones. Having done this, you can check the first few observations of your data set.

# Check the first 10 observations of your data set
(print(head(pregnancy.df, n = 10)))

Using the parentheses around your code enables R to print the output on your console without going to the extra steps.

# Check the last 10 observations of your data set
(print(tail(pregnancy.df, n = 10)))

Alternatively, you can also use the following codes to explore your data set.

# Check the first 10 observations with an alternative code
(print(head(pregnancy.df, 10)))
# Check the last 10 observations with an alternative code
(print(tail(pregnancy.df, 10)))

DESCRIPTIVE STATISTICS

From the summary statistics, you can calculate the median, first quartile (25th percentile), third quartile (75th quartile), upper and lower whiskers also known as inner and outer tukey fences. You can get all these information from summarizing the data.

# Use summary() function to summarize your data
summary(pregnancy.df)
 GestationDays  
 Min.   :240.0  
 1st Qu.:260.0  
 Median :267.0  
 Mean   :266.6  
 3rd Qu.:272.0  
 Max.   :349.0  

From the output, we see the minimum gestation days is 240 days or 8 months, while the maximum gestation days is 349 days or about 11 months. Therefore, the range of days between gestation is 109 days or about 3 months. This may suggest lots of variability in gestation days. Plus, the 25th percentile of data is 260 days or about 8,66 months and the 75th percentile is 272 or about 9,06 months. Finally, the median gestation days is 267 and the “typical” average gestation days is about 267 days or exactly 9 months. Let’s check the histogram of gestation days.

# Call for the required packages
library(MASS)
library(car)
library(ggplot2)
# Histogram of gestation days
ggplot(data = pregnancy.df, mapping = aes(x = GestationDays)) +
  geom_histogram(bins = 20, fill = "red") +
  labs(x = "Gestion Days") +
  ggtitle("Histogram of Gestation Days") +
  theme(plot.title = element_text(hjust = 0.5))

The output has a peaked bar with gestation days at about 272 days. Most of the values are bunched up in the left side of the histogram, with few values spread along the right tail. This tells us that the data points are not normally distributed. In addition, the plot reveals the presence of an outlier in the lower right bound of the plot. This outlier is completely removed from other observations. Let’s further explore this data set by displaying line counts.

# Histogram of gestation days displaying count with lines
ggplot(data = pregnancy.df, mapping = aes(x = GestationDays)) +
  geom_freqpoly(bins = 20) +
  labs(x = "Gestation Days") +
  ggtitle(expression(atop("Histogram of Gestation Days", atop(italic("With Line Counts", ""))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

This output corroborates with what we previously said about the distribution of the histogram. We can say the gestation days are positively skewed, with observations stretching and spreading along the tail. Similarly to previous observations, data points are clustered in the left side, proving the asymmetric distribution of the values of the variable. Finally, the mean and the median occur at different points, though by a little difference, thus renforcing a asymmetric distribution. Using the histogram is a good way to assess the shape and spread of the data and to identify any potential outliers. Similarly, the boxplot is also vital to evaluate the spread of the data. Let’s go back to the summary and get a good understanding of skewness.

# summary of the dataset
summary(pregnancy.df)
 GestationDays  
 Min.   :240.0  
 1st Qu.:260.0  
 Median :267.0  
 Mean   :266.6  
 3rd Qu.:272.0  
 Max.   :349.0  

In the results, the numeric sequence is (240, 266.6, 267, 349). This numeric sequence displays the symptoms of a positively skewed distribution. You notice that the value 349 is far above the mean 266.6. Let me explain the difference between negative and positive skewness to fully understand the concept of asymmetric distribution.

In a negative skew, the left tail is longer. Plus the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed. Left-skewed refers, furthermore, to the left tail being drawn out. A left-skewed distribution usually appears as a right-leaning curve.

In a positive skew, the right tail is longer. The mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed. Right-skewed refers, furthermore, to the right tail being drawn out. A right-skewed distribution usually appears as a left-leaning curve.

INTERQUARTILE RANGE

In the summary results, the interquartile range is equal to 272 minus 260, IQR = maximum - minimum. Alternatively, you can call the built-in IQR() function on the GestationDays column to calculate the IQR.

# Apply the Interquartile Range, IQR(), function on the GestationDays column
pregnancy.df.IQR <- 272 - 260
pregnancy.df.IQR <-IQR(pregnancy.df$GestationDays)
pregnancy.df.IQR
[1] 12

You can then compute the lower and upper tukey fences, thanks to John Wilder Tukey. John Wilder Tukey was an American mathematician best known for the development of the FFT algorithm and box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmuller-Tukey lemma all bear his name.

# LowerInner Fence
LowerInnerFence <- 260 - 1.5 * pregnancy.df.IQR
# UpperInner Fence
UpperInnerFence <- 272 + 1.5 * pregnancy.df.IQR
# Check the content of LowerInnerFence and UpperInner Fence
print(LowerInnerFence); print(UpperInnerFence)
[1] 242
[1] 290

Using R’s which() function, it is easy to determine the points and their indices that violate the fences. The power of which() function is that it gives you all the data points that are out of bounds on either side of the Tukey fences.

# Check all the points above the upperInner Fence using which() function
print(which(pregnancy.df$GestationDays > UpperInnerFence))
[1]   1 249 252 338 345 378 478 913

In the results, these points are the ones with values higher than the upper tukey fence of 290. You have 8 data points that are outliers.

# Check all the points below the LowerInner Fence using which() function
print(which(pregnancy.df$GestationDays < LowerInnerFence))
[1]  99 794

In the results, two points are below the lower Tukey fence. These are the data values with numbers 99 and 794. Having done these, you would like to find the data values that are above and below the tukey fences.

# Values above the upper tukey fence
(pregnancy.df$GestationDays[which(pregnancy.df$GestationDays > UpperInnerFence)])
[1] 349 292 295 291 297 303 293 296

Voila, the output gives us 8 values that are above the upper tukey fence. These are values above 290 days of gestation.

# Values below the lower tukey fence
pregnancy.df$GestationDays[which(pregnancy.df$GestationDays < LowerInnerFence)]
[1] 241 240

In the results, the output displays 2 values that are below the lower tukey fence. These are values below 242 days of gestation.

Since you have all the needed information, you will have to plot a boxplot using the R’s boxplot() function. The boxplot() function will graph the median, first quartile and third quartile, tukey fences, and any outliers. To do this drawing, you pass the GestationDays column to the boxplot() function.

# Simple boxplot of gestation days
(boxplot(pregnancy.df$GestationDays))
$stats
     [,1]
[1,]  242
[2,]  260
[3,]  267
[4,]  272
[5,]  290
attr(,"class")
        1 
"integer" 

$n
[1] 1000

$conf
         [,1]
[1,] 266.4004
[2,] 267.5996

$out
 [1] 349 241 292 295 291 297 303 293 240 296

$group
 [1] 1 1 1 1 1 1 1 1 1 1

$names
[1] "1"

The tukey fences can be modified to be “outlier” fences by changing the range flag in the boxplot call(it defaults to 1.5 times the IQR). If you set range = 3, then the tukey fences are drawn at the last point inside three times the IQR instead.

# Boxplot with tukey fences at the last outlier point
(boxplot(pregnancy.df$GestationDays, range = 3))
$stats
     [,1]
[1,]  240
[2,]  260
[3,]  267
[4,]  272
[5,]  303
attr(,"class")
        1 
"integer" 

$n
[1] 1000

$conf
         [,1]
[1,] 266.4004
[2,] 267.5996

$out
[1] 349

$group
[1] 1

$names
[1] "1"

If you set the range = 0 the upper tukey will expand to the last data point, which is our greatest outlier point. The plot below shows a boxplot with range 0.

(boxplot(pregnancy.df$GestationDays, range = 0))
$stats
     [,1]
[1,]  240
[2,]  260
[3,]  267
[4,]  272
[5,]  349
attr(,"class")
        1 
"integer" 

$n
[1] 1000

$conf
         [,1]
[1,] 266.4004
[2,] 267.5996

$out
numeric(0)

$group
numeric(0)

$names
[1] "1"

You can improve the boxplot by using ggplot2. You can draw a boxplot and show the colored outliers.

# boxplot using ggplot2 with colored outliers
ggplot(data = pregnancy.df, mapping = aes(x = " ", y = GestationDays)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 1) +
  ggtitle(expression(atop("Boxplot of The Pregnancy Duration Data", atop(italic("With Red Outliers in Tukey Fences Using 3 Times The IQR"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

In the output result, the tiny red cirles are outliers. You have 8 outliers above the box and 2 outliers below the box. Let’s expand our plot by adding the colored mean to the boxplot.

# boxplot with the mean added
ggplot(data = pregnancy.df, mapping = aes(x = " ", y = GestationDays)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 3, fill = "cyan") +
  ggtitle(expression(atop("Boxplot of The Pregnancy data Duration",atop(italic("With Mean, Colored in cyan, and Outliers, in Upper and Lower Tukey Fences", " "))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

In the output, the mean is the angular square colored in cyan and lodged in the middle of the box.

# Use dashed lines to show the mean of the boxplot
ggplot(data = pregnancy.df, mapping = aes(x = " ", y = GestationDays)) +
  geom_boxplot() +
  stat_summary(geom = "text", label = "----", size = 10, color = "blue")  +
  ggtitle(expression(atop("Boxplot of Gestation Days", atop(italic("With Mean, blue dashed lines, and Outliers, in Upper and Lower Tukey Fences", " "))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

Again the mean is the dashed line below the median, thus confirming the output of the summary statistics. Additionally, you can pull the IQR of the boxplot rather than plot it. Using the stats list, you get the tukey fences (Upper and Lower bounds) and the quartile ranges. Let’s do this in code by passing pregnancy duration days arguments to boxplot function and apply the stats list to the boxplot function.

# Using stats list to find the boxplot distribution.
(boxplot(pregnancy.df$GestationDays, range = 3)$stats)
     [,1]
[1,]  240
[2,]  260
[3,]  267
[4,]  272
[5,]  303
attr(,"class")
        1 
"integer" 

You can also use out list to get a list of outliers.

# Using out list to get the list of outliers
(boxplot(pregnancy.df$GestationDays, range = 3)$out)
[1] 349

The R console gives us the maximum outlier value, which is 349 days. Let’s move to the harder part of this tutorial by finding outliers in the call center employee performance data.

# call the required libraries, even though we won't use them all.
library(tidyverse)
library(MASS)
library(car)
library(ggplot2)

Next, we need to load our data set. The data is already loaded. In this case, I will only rename the Call Center employee performance data. The name CallCenter is already given to the data, as observed in the R global environment. Let’s check the first and last few records in the data.

# First 10 observations of the CallCenter data
(head(CallCenter, n = 10))
# Last 10 observations of the CallCenter data
(tail(CallCenter, n = 10))

You can further explore the structure of the data by using the R str() function.

# Check the structure of the data using str() function
print(str(CallCenter))
'data.frame':   400 obs. of  11 variables:
 $ ID               : int  144624 142619 142285 142158 141008 145082 139410 135014 139356 137368 ...
 $ AvgTix           : num  152 155 164 159 156 ...
 $ Rating           : num  3.32 3.16 4 2.77 3.52 3.9 3.45 3.67 3.4 3.3 ...
 $ Tardies          : int  1 1 3 0 4 3 3 0 0 1 ...
 $ Graveyards       : int  0 3 3 3 1 2 3 3 1 3 ...
 $ Weekends         : int  2 1 1 1 0 1 1 1 1 1 ...
 $ SickDays         : int  3 1 0 2 3 3 3 1 4 0 ...
 $ PercSickOnFri    : num  0 0 0 0.5 0.67 1 0 0 0.25 0 ...
 $ EmployeeDevHrs   : int  0 12 23 13 16 5 13 18 14 33 ...
 $ ShiftSwapsReq    : int  2 1 2 1 1 1 2 1 0 2 ...
 $ ShiftSwapsOffered: int  1 2 0 0 0 0 1 2 3 4 ...
NULL

The CallCenter data is a dataframe containing 400 observations and 11 variables. And all the variables are numeric-continuous variables. Also, you observe the presence of NULL. You can, additionally, do a summary statistics of the data to see the scope of its distribution.

# Summary statistics of the CallCenter data using the R summary() function
# You code this by passing the CallCenter data to the summary() function
(summary(CallCenter))
       ID             AvgTix          Rating         Tardies        Graveyards       Weekends     
 Min.   :130564   Min.   :143.1   Min.   :2.070   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
 1st Qu.:134402   1st Qu.:153.1   1st Qu.:3.210   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.0000  
 Median :137907   Median :156.1   Median :3.505   Median :1.000   Median :2.000   Median :1.0000  
 Mean   :137946   Mean   :156.1   Mean   :3.495   Mean   :1.465   Mean   :1.985   Mean   :0.9525  
 3rd Qu.:141771   3rd Qu.:159.1   3rd Qu.:3.810   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :145176   Max.   :168.7   Max.   :4.810   Max.   :4.000   Max.   :4.000   Max.   :2.0000  
    SickDays     PercSickOnFri    EmployeeDevHrs  ShiftSwapsReq   ShiftSwapsOffered
 Min.   :0.000   Min.   :0.0000   Min.   : 0.00   Min.   :0.000   Min.   :0.00     
 1st Qu.:0.000   1st Qu.:0.0000   1st Qu.: 6.00   1st Qu.:1.000   1st Qu.:0.00     
 Median :2.000   Median :0.2500   Median :12.00   Median :1.000   Median :1.00     
 Mean   :1.875   Mean   :0.3522   Mean   :11.97   Mean   :1.448   Mean   :1.76     
 3rd Qu.:3.000   3rd Qu.:0.6700   3rd Qu.:17.00   3rd Qu.:2.000   3rd Qu.:3.00     
 Max.   :7.000   Max.   :1.0000   Max.   :34.00   Max.   :5.000   Max.   :9.00     

Centering Your Data

You can center the columns of a numeric matrix. R makes it easier to center variables by using the R scale() function. To center a variable, you would substract the mean of all data points from each individual data point. The scale() function makes use of the following arguments: (1) x: a numeric object, (2) center: if TRUE, the objects’ column means are substracted from values in those columns. If center = FALSE, centering is not performed, (3) scale() if TRUE, the centered column values are divided by the column’s standard deviation. If FALSE, scaling is not performed. In the CallCenter, you can scale variables 2 to 11 because the first variable is related to employees’ IDs.

# Using scale() function to center variables 2 to 11
# Here is the structure of the scale() function: scale(data, center = TRUE or FALSE, scale = TRUE or FALSE)
CallCenter.scale <- scale(CallCenter[2:11], center = TRUE, scale = TRUE)
# Check the first 5 observations of the CallCenter.scale data
(head(CallCenter.scale, n = 5))
         AvgTix      Rating   Tardies Graveyards   Weekends    SickDays PercSickOnFri EmployeeDevHrs
[1,] -0.9703649 -0.37952536 -0.478052  -2.498183  1.9092970  0.67215066    -0.8962958   -1.602226830
[2,] -0.2005485 -0.72622281 -0.478052   1.277408  0.0865791 -0.52278385    -0.8962958    0.004015606
[3,]  1.8372007  1.09393879  1.578086   1.277408  0.0865791 -1.12025110    -0.8962958    1.476404506
[4,]  0.6598345 -1.57129784 -1.506121   1.277408  0.0865791  0.07468341     0.3761287    0.137869142
[5,] -0.1326235  0.05384645  2.606155  -1.239653 -1.7361388  0.67215066     0.8087530    0.539429752
     ShiftSwapsReq ShiftSwapsOffered
[1,]     0.5525710        -0.4192811
[2,]    -0.4475575         0.1324046
[3,]     0.5525710        -0.9709668
[4,]    -0.4475575        -0.9709668
[5,]    -0.4475575        -0.9709668
# Check the last 5 observations of the CallCenter.scale data
(tail(CallCenter.scale, n = 5))
           AvgTix      Rating    Tardies  Graveyards  Weekends    SickDays PercSickOnFri EmployeeDevHrs
[396,]  1.4749342 -0.27118241  0.5500169  1.27740845 0.0865791 -0.52278385     1.6485532     -0.7991056
[397,]  0.7956845  0.01050927 -0.4780520  0.01887796 0.0865791  0.07468341    -0.8962958      0.4055762
[398,] -0.1552652  1.72232792  1.5780857  1.27740845 0.0865791  0.07468341     0.3761287      0.8071368
[399,] -1.2194231  1.07227020 -0.4780520  0.01887796 0.0865791  1.26961792    -0.2600836      2.4133793
[400,]  1.0221010  0.81224712 -0.4780520 -1.23965254 0.0865791  0.07468341     1.6485532      0.5394298
       ShiftSwapsReq ShiftSwapsOffered
[396,]    -0.4475575         0.6840903
[397,]    -0.4475575        -0.4192811
[398,]    -0.4475575         0.6840903
[399,]     0.5525710         0.6840903
[400,]    -0.4475575        -0.9709668

You can now perform a descriptive statistics of the scaled data.

# First let's look at the structure of the scaled data using the str() function
print(str(CallCenter.scale))
 num [1:400, 1:10] -0.97 -0.201 1.837 0.66 -0.133 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:10] "AvgTix" "Rating" "Tardies" "Graveyards" ...
 - attr(*, "scaled:center")= Named num [1:10] 156.086 3.495 1.465 1.985 0.953 ...
  ..- attr(*, "names")= chr [1:10] "AvgTix" "Rating" "Tardies" "Graveyards" ...
 - attr(*, "scaled:scale")= Named num [1:10] 4.417 0.461 0.973 0.795 0.549 ...
  ..- attr(*, "names")= chr [1:10] "AvgTix" "Rating" "Tardies" "Graveyards" ...
NULL
# Summary statistics of the scaled data
(summary(CallCenter.scale))
     AvgTix              Rating            Tardies          Graveyards          Weekends       
 Min.   :-2.940189   Min.   :-3.08810   Min.   :-1.5061   Min.   :-2.49818   Min.   :-1.73614  
 1st Qu.:-0.681684   1st Qu.:-0.61788   1st Qu.:-0.4781   1st Qu.:-1.23965   1st Qu.: 0.08658  
 Median :-0.008094   Median : 0.02134   Median :-0.4781   Median : 0.01888   Median : 0.08658  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
 3rd Qu.: 0.682476   3rd Qu.: 0.68224   3rd Qu.: 0.5500   3rd Qu.: 0.01888   3rd Qu.: 0.08658  
 Max.   : 2.856075   Max.   : 2.84909   Max.   : 2.6062   Max.   : 2.53594   Max.   : 1.90930  
    SickDays        PercSickOnFri     EmployeeDevHrs      ShiftSwapsReq     ShiftSwapsOffered
 Min.   :-1.12025   Min.   :-0.8963   Min.   :-1.602227   Min.   :-1.4477   Min.   :-0.9710  
 1st Qu.:-1.12025   1st Qu.:-0.8963   1st Qu.:-0.799106   1st Qu.:-0.4476   1st Qu.:-0.9710  
 Median : 0.07468   Median :-0.2601   Median : 0.004016   Median :-0.4476   Median :-0.4193  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.67215   3rd Qu.: 0.8088   3rd Qu.: 0.673283   3rd Qu.: 0.5526   3rd Qu.: 0.6841  
 Max.   : 3.06202   Max.   : 1.6486   Max.   : 2.948793   Max.   : 3.5530   Max.   : 3.9942  

In the results, you can see that means of all variables are 0.0000. Now that the data is ready to be analyzed, you can send it to the lofactor() function that is part of the DMwR package. I have already downloaded the package.

library(DMwR)

The lofactor() function contains in the DMwR package locates local outliers using the LOF algorithm. Namely, given a data set it produces a vector of local outlier factors for each case. This vector has as many values as there are rows in the original data set. This code is an implementation of the LOF method by Breunig et al.(2000). Learn more by typing ?lofactor() in the R console.

To call the lofactor() function, you supply it the data and a k value. This example uses 5 and the function spits out LOFs.

(CallCenter.lof <- lofactor(CallCenter.scale, 5))
  [1] 1.3408038 1.0051834 1.1123486 1.0557821 1.2371926 1.0137109 1.0883557 0.9813427 1.1206137 1.2521956
 [11] 1.0558297 1.1269558 0.9841484 1.0667666 1.0307162 0.9844507 1.2741940 1.0055474 1.1982345 1.4698625
 [21] 1.4470641 1.0232575 1.0372341 1.3195439 1.0611093 1.0416615 1.0102924 1.1432867 0.9786627 1.0040209
 [31] 1.0447568 1.0009882 1.0171988 1.0901495 1.0996541 1.2948341 1.1858127 1.0068645 1.0685252 1.1423162
 [41] 1.1144680 1.0378742 0.9844365 0.9779774 1.0250004 1.0110349 1.0790022 1.1309026 1.0798142 1.0822591
 [51] 1.0788040 1.1019461 0.9737883 1.0052269 1.0456276 1.0107045 1.0307720 1.1643033 1.2431839 0.9666509
 [61] 1.1031478 1.2003758 1.1167494 1.1037606 1.0255370 1.0171380 0.9957432 1.1743580 1.0477126 1.0034990
 [71] 1.0462489 1.0061843 1.0380687 0.9862380 1.0383248 1.4241432 1.1441133 0.9893924 1.0315633 1.1505668
 [81] 1.0325250 0.9889635 1.0071661 1.3346601 1.1791746 0.9649603 0.9905224 1.0649977 1.0129738 1.0453200
 [91] 1.1590392 1.1675810 0.9779721 1.4029638 1.3523042 1.0552061 0.9969275 1.0830638 1.0101884 1.0883310
[101] 1.0109415 1.3044815 1.0212112 0.9980336 1.0030789 1.0302064 0.9247193 1.0062806 0.9984376 1.1162410
[111] 0.9823186 1.0060886 1.1589316 0.9892826 1.0830274 1.0135637 1.0046389 1.0365509 0.9813457 1.0145190
[121] 1.2653442 1.0683295 1.0082897 0.9607305 1.0923547 1.1270543 1.1689214 1.0843162 1.2911142 1.1057245
[131] 1.0011486 1.1370280 1.1232655 1.2033569 1.2450830 1.0566725 1.0761990 1.0091205 1.1092464 1.2047034
[141] 1.0955902 0.9844139 1.1383286 1.1259207 0.9892031 1.2447260 0.9981420 1.0821813 1.1413045 1.0686800
[151] 1.0225226 1.2905067 1.0074159 1.1910622 1.1519677 0.9876663 1.0667645 1.0663731 0.9813381 0.9904982
[161] 1.1137023 1.1264898 1.0150397 1.1977160 1.3008639 1.1213267 1.0384217 0.9937663 0.9767749 0.9593312
[171] 1.0462209 1.0327596 1.0521310 1.0656835 1.1480740 1.0777549 1.1232615 1.0529995 1.1679990 1.0196460
[181] 0.9740810 1.0535315 1.0782753 1.2429414 1.1445937 1.0215425 1.0074387 1.0121273 1.0434222 1.1255809
[191] 1.0216994 1.1588358 1.0367122 1.1312298 0.9795271 1.1298598 1.0297991 1.1183479 1.0991143 1.3991179
[201] 1.1168922 1.2455259 1.0816046 1.0284624 1.1158375 1.0948648 1.3037175 1.0254871 0.9641669 1.2651222
[211] 1.0007618 0.9746770 1.0531306 1.4217901 1.1150524 0.9820805 1.1612901 0.9964528 0.9921633 1.0277144
[221] 1.0293756 1.1196775 1.1566654 1.2297558 1.1079424 0.9904334 1.1149821 0.9939430 1.0010735 0.9851308
[231] 1.3394905 1.0302344 1.1841599 1.0149589 1.2804495 1.1880453 1.4601193 1.0050842 1.4499324 1.1135559
[241] 1.1988263 1.0022138 1.1468372 1.1646916 1.0202601 0.9939261 1.0032967 1.0260007 1.0496837 1.0049157
[251] 0.9729191 1.0695047 0.9634032 1.0613979 1.1142692 1.0043943 1.0457583 1.0133710 1.1736124 1.0394262
[261] 1.0610609 1.0983300 1.1919957 1.1745827 1.0316455 1.3433954 1.1771837 1.0269947 1.0587343 1.0858310
[271] 1.2692550 1.0255964 1.0678142 0.9860664 1.0386526 1.2048119 1.0760391 0.9994004 1.2161886 1.1492698
[281] 1.0641118 1.0133139 1.1645779 1.0864450 1.0368234 1.2167903 0.9973810 1.0175809 1.0994488 1.0995937
[291] 1.1432109 1.0289309 1.0121983 1.1069686 1.1417604 1.2159379 1.0931900 1.1261413 1.7605723 1.0864391
[301] 1.0452020 1.2102214 1.0849146 0.9871716 1.0030268 1.0615736 1.1023937 0.9654460 0.9870671 1.0789900
[311] 1.2592372 1.0480225 1.0548818 1.0127955 1.1235240 0.9894615 1.0275476 1.2180248 1.0251506 0.9937405
[321] 0.9994192 0.9948396 1.1437842 1.0398916 1.1639307 0.9952146 1.0284795 1.0525406 1.0992120 1.0248649
[331] 1.0105702 1.0191950 1.0679836 1.0507263 1.4221162 1.0126496 1.0434213 1.1336063 1.1646151 0.9946414
[341] 1.0229918 1.3907169 0.9720104 1.3128955 1.2216433 1.0762238 1.0554889 0.9963017 1.1952126 1.1804749
[351] 0.9874807 1.0866453 0.9727127 0.9560749 1.0627616 1.0856518 1.0885088 1.2406137 1.0608744 1.0379972
[361] 1.0242488 1.0376633 1.0651370 1.0234395 1.0237234 0.9611999 1.0855394 1.0771785 0.9786611 1.1258066
[371] 1.1235259 1.0410275 1.1797463 1.9704225 0.9979773 1.2055868 1.1864624 1.0014150 1.1431236 1.1302655
[381] 1.0768061 0.9992923 1.0443288 1.0351075 1.0511732 1.0210434 1.0309915 1.0340882 1.0100678 1.0963848
[391] 1.1180203 1.1697097 0.9867001 1.0532558 1.0567717 1.0066670 1.0259471 1.1069178 1.1623194 0.9841127

Data with the highest factors (LOFs usually hover around 1) are the oldest points. For instance, you want to highlight the data associated with those employees whose LOF, Local Outlier Factor, is greater than 1.5.

# LOF greater than 1.5 using which() function
print(which(CallCenter.lof > 1.5))
[1] 299 374

In the results, employees 299 and 374 have local outliers greater than 1.5. You can also look for the lowest factors. For example, you can highlight the data associated with those employees whose LOF, Local Outlier Factor, is lower than 1.5. However, this step is not necessary.000000000000000

# LOF lower than 1.5 using the which() function
print(which(CallCenter.lof < 1.5))
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25
 [26]  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50
 [51]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75
 [76]  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
[101] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
[126] 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
[151] 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
[176] 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
[201] 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225
[226] 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
[251] 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275
[276] 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 300 301
[301] 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326
[326] 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351
[351] 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 375 376 377
[376] 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
# You may want to see variables associated with these data points
# You want rows 299 and 374 and their corresponding records from the original data CallCenter
print(CallCenter[which(CallCenter.lof > 1.5), ])
# Variables associated with all records lower than 1.5 from the CallCenter data
print(CallCenter[which(CallCenter.lof < 1.5), ])
