Local Outlier Factor (LOF) is an unsupervised machine learning algorithm to detect outlier observations in a multivariate dataset. LOF implementation can be applied to various conditions, such as seeing unusual points in a dataset. Furthermore, the algorithm can be used to detect anomaly conditions. Detecting anomaly conditions is crucial to prevent unwanted conditions further. The application of anomaly detection can be used in various industries such as e-commerce or the banking industry. Preventing anomalies or unusual situations from getting bigger will help a company preventing financial loss. Here, we will learn about the Local Outlier Factor algorithm, the basics behind this algorithm, and this algorithm works to detect unusual observations.
Outlier is observations that do not follow the same pattern compared to the rest of the data. For example, a boy with a 100 score for his math test while his friends mainly got around 30 is the outlier. The score achieved by the boy is considered an outlier since the pattern of the data here is 30. A-10-years old girl with an IQ of 170 while his classmates’ IQ is around 100 is also an outlier since most of the information in that situation tells us that the “normal” number is 100.
There are two types of an outlier, point outlier and collective outlier. Point outlier is a single data point that is unusual compared to the rest of the data, while collective outlier is more than a 1 point outlier than the rest of the data. Instances of point outlier have been introduced earlier, and it can be visualized with a graphical summary, the boxplot.
For example, there are 101 students attending math tests with the following result.
math_test <- c(rnorm(100, mean=30, sd=4),100)
boxplot(math_test,
main = "Math Test Score",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)The student mostly got 30 for their marks, while there was one student who stunningly got 100.
An example of a collective outlier can be visualized as follow:
math_test2 <- c(rnorm(100, mean=30, sd=4),90,80,93,100)
boxplot(math_test2,
main = "Math Test Score",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)Four students achieved stunning scores compared to their classmates. The four students here was considered a collective outlier. The example of a collective outlier can be found in many datasets since the most dataset is contributed with many observations.
Visualization is a great tool to see outlier points. Besides using visualization, we can also use statistical procedures to validate if a data point is an outlier or not. One Test that can handle is Grubbs’ Test. Grubbs’ Test will see the most significant or most minor point in the data by assuming the data are typically distributed. When using Grubb’s Test, the user needs to check the distribution of the data. One common way to see the distribution of the data is using the histogram. A typical normal distributed histogram can be seen like this.
data_example <- rnorm(10000, mean= 50, sd = 15)
hist(data_example)With the example of math score above, we can do Grubbs’ Test from the outlier package as follow:
library(outliers)
grubbs.test(math_test)##
## Grubbs test for one outlier
##
## data: math_test
## G = 8.53907, U = 0.26355, p-value < 2.2e-16
## alternative hypothesis: highest value 100 is an outlier
The threshold for p-value is 0.05, and if the p-value is below 0.05, the point is considered an outlier. Grubbs’s Test will produce the hypothesis for either the highest or lowest value in the dataset. Grubbs’s Test indicates that the highest value is an anomaly with the p-value below 0.05. We can also see which row is an outlier by specifying the max or the min of the data; here, we will use max.
which.max(math_test)## [1] 101
math_test[101]## [1] 100
Row number 101 is the student with a score of 100, the student that got a far better result than the rest of his classmates.
Local Outlier Factor or often abbreviated as LOF, is an outlier detection algorithm by doing density-based scoring. The algorithm is similar to k-Nearest Neighbour (KNN) algorithm, while the difference is that the user will try to find observations that share similar characteristics. The output of KNN can be seen with many points close together. In LOF, the user will try to find observations that are not alike. To understand the basic algorithm of LOF, we need to evaluate some fundamental theories as follow:
K-Nearest Neighbour (KNN) is a classification algorithm with k nearest observations. The KNN has some algorithm before deciding which class an observation will be classified into. K-distance is the distance between an observation to its k nearest observations taking its mean value.
`
In the example plot above, the initial observation is circled by the three nearest observations. The number 3 here is the element of k.
Suppose we have four observations with the following points.
library(ggplot2)
library(hrbrthemes)
name <- as.factor(c("Apple", "Banana", "Cucumber", "Durian"))
x <- c(0, 1, 1, 0)
y <- c(0, 0, 1, 3)
df <- data.frame(name, x, y)
ggplot(df, aes(x=x, y=y, color=name)) +
geom_point(size=6) +
xlim(-4, 4) +
ylim(-4, 4) +
theme_ipsum()There is one point represented by Durian that hints at us as an outlier since it is far from the rest of the data. Since we are dealing with multivariate, we can not use Grubbs’ test to assess outlier of this dataset. Hence, we can use the Local Outlier Factor to detect which observation is considered an outlier.
We will calculate the distance between those observations with get.knn from library FNN. The non-numeric column should be dropped since it is not representing numerical value to calculate the distance. We also use k=2 to see the two nearest observations from a point.
library(FNN)
library(tidyverse)
knn_df <- get.knn(data = df %>% select(-name), k = 2)
head(knn_df$nn.dist)## [,1] [,2]
## [1,] 1.000000 1.414214
## [2,] 1.000000 1.000000
## [3,] 1.000000 1.414214
## [4,] 2.236068 3.000000
The output produced by the code above is the distance matrice from each observation. The first column is the distance between the first observation and its closest neighbour, while the second column is the distance between the first observation and the second closest neighbour. Since the k = 2, the output matrice will consist of two columns, and the distance calculation is Euclidean distances.
To get the KNN distance score, we will average each observation based on its distance score to the k closest neighbour.
df_score <- rowMeans(knn_df$nn.dist)
df_score## [1] 1.207107 1.000000 1.207107 2.618034
From the distance above, we can see that the point with the most significant distance score can be classified as an outlier.
which.max(df_score)## [1] 4
The code shows that the fourth observation, Durian, is a hint as an outlier point.
We can also visualize the distance score for those observations for a more straightforward interpretation.
df$distance <- df_scorelibrary(plotly)
a <- ggplot(df, aes(x=x, y=y, color=name)) +
geom_point(aes(size=distance)) +
xlim(-1, 2) +
ylim(-1, 4) +
theme_ipsum()
ggplotly(a)The size of the point is based on the distance score, and we can see above that the Durian has more bold bullets than the rest of the observations.
Lof will calculate the local density of a point compared to the local density of its k neighbour. Suppose we have an example in the following picture.
`
In the picture above, we can see that point o1 is an outlier since the distance of the point is much larger than any of the nearest points. The question is, how we suppose to label o2? With the local outlier factor algorithm, we can calculate the score of o2 and see if it is an outlier or not. The local outlier factor will produce a score with the following interpretation:
If the point is less dense than the high-density area in C2, it is more likely to be an outlier. The more isolated point compared to its high-density neighbour, the more likely an observation is categorized as an outlier.
`
Another example is o1, and o2 can be considered an outlier, and the terms for this outlier are local. o3 is the global outlier, while o4 is not an outlier since the neighbourhood surrounding the point is not as dense as the density near o1, o2 or o3.
Before we go with the function, we will try to calculate the local outlier factor score manually using the data above.
Using the get.knn function, we can calculate the distance between a pair of points for all observations. Since the data consist of 4 observations, we will use k=3 to calculate all possible neighbours.
example_distance <- get.knn(data = df %>% select(-name), k = 3)
example_distance## $nn.index
## [,1] [,2] [,3]
## [1,] 2 3 4
## [2,] 1 3 4
## [3,] 2 1 4
## [4,] 3 1 2
##
## $nn.dist
## [,1] [,2] [,3]
## [1,] 1.021221 1.414214 3.315225
## [2,] 1.021221 1.021221 3.552187
## [3,] 1.021221 1.414214 2.643996
## [4,] 2.643996 3.315225 3.552187
The output above produce two results, the $nn.index provides the information about the k-nearest point of the data. For example, in the first line representing the first observation, the first nearest point is observation number 2, the second nearest point is observation number 3, and the third nearest point is observation number 4.
The second result produces distance for all observations to its nearest neighbour using Euclidean Distance. For example, in the second observation, the distance between the second and the third observation is 1, while the distance between the second and the fourth observation is 3.162278.
Next, we need to calculate the distance for each point to the Kth nearest neighbour. Here, since we use k=2, we need to find the distance to its second nearest point. We can see again the $nn.index result above.
The second nearest point of Apple is Cucumber, and the distance between those points is 1.
apple_kth_distance <- 1
apple_kth_distance## [1] 1
The second nearest point of Banana is Cucumber. From the result of $nn.dist, we can see that the distance between Apple and Cucumber is the same. We can use either distance.
banana_kth_distance <- 1
banana_kth_distance## [1] 1
The second nearest point of Cucumber is Apple.
cucumber_kth_distance <- 1.414214
cucumber_kth_distance## [1] 1.414214
The second nearest point of Durian is Cucumber.
durian_kth_distance <- 3.000000
durian_kth_distance## [1] 3
Local reachability density is the distance at which point can be found by its neighbours. The calculation will be represented by the number of parameter k divided by the reachability distance of all k nearest neighbours. Reachability distance is the maximum distance between the point to its nearest neighbour and its nearest neighbour to its kth nearest neighbour. To see it clearly, we can calculate the reachability distance for the point of A, Apple.
All nearest points of Apple are Banana and Cucumber. Then, we need to find reachability distance for both Banana and Cucumber.
The distance between Banana to Apple is 1, while Banana itself has a kth nearest neighbour. The kth or the second nearest neighbour of Banana is Cucumber, and the distance between Banana and Cucumber is 1. Hence we can calculate the maximum value from those distances.
rd_ab <- max(1,1)
rd_ab## [1] 1
Next, we can calculate the reachability distance for Cucumber. The distance between Cucumber to Apple is 1.414214. Cucumber’s second nearest point is Apple, and the distance is 1.414214.
rd_ac <- max(1.414214, 1.414214)
rd_ac## [1] 1.414214
Next, we can find the local rechability density of Apple. We have known that k = 2.
k = 2
lrda <- k/ (rd_ab + rd_ac)
lrda## [1] 0.828427
We can also calculate all local rechabilty distance for each observations.
Here is the calculation for Banana.
rd_ba <- max(1, 1.414214)
rd_bc <- max(1, 1.414214)
lrdb <- k/(rd_ba + rd_bc)
lrdb## [1] 0.7071066
Here is the calculation for Cucumber
rd_cb <- max(1, 1)
rd_ca <- max(1.414214, 1.414214)
lrdc <- k/(rd_cb + rd_ca)
lrdc## [1] 0.828427
And here is the calculation for Durian
rd_dc <- max(2.236068, 1.414214)
rd_da <- max(3.000000,1.414214)
lrdd <- k/(rd_dc + rd_da)
lrdd## [1] 0.381966
The final score of lof will be the sum of all local reachability distances multiplied by all reachability distances. The score is then divided by the multiplication of the number of k parameters for each nearest neighbours.
To calculate the local outlier factor score for Apple, the calculation will be as follow.
lof_apple <- ((lrdb + lrdc) * (rd_ab + rd_ac))/ (k * k)
lof_apple## [1] 0.9267766
We can also calculate the local outlier factor score for each observations.
Lof for Banana
lof_banana <- ((lrda + lrdc) * (rd_ba + rd_bc))/ (k * k)
lof_banana## [1] 1.171573
Lof for Cucumber
lof_cucumber <- ((lrda + lrdb) * (rd_ca + rd_cb))/ (k * k)
lof_cucumber## [1] 0.9267766
Lof for Durian
lof_durian <- ((lrda + lrdc) * (rd_da + rd_dc))/ (k * k)
lof_durian## [1] 2.16885
In reality, we cannot always calculate the lof score manually since we often deal with many points. To calculate the result above, we can use the lof function from the package of dbscan. Remember that the iteration will be counted from numerical value, then we need to drop the categorical variable first.
library(dbscan)
exampe_lof <- lof(df %>% select(-name), k = 2)
exampe_lof## [1] 0.9305282 1.1613642 0.9305282 2.4468815
df$lof <- exampe_lof
b <- ggplot(df, aes(x=x, y=y, color=name)) +
geom_point(aes(size=lof)) +
xlim(-1, 2) +
ylim(-1, 4) +
theme_ipsum()
ggplotly(b)From the manual calculation above, the algorithm will compare the density from one point to its nearest point. If the density is similar and the lof score is a ratio, the result will be 1. If the density of the neighbour is lesser than the density of the point, it indicates that that point is still inside the cluster; hence it is not an outlier. The result of the lof score itself will be less than 1. If the density of the nearest neighbour of the point is higher than then the density of the point itself, the point is likely to be isolated, and it is an outlier. The lof score will be greater than 1.
The illustration of the local outlier factor above can be used to detect outliers from a dataset. Furthermore, the algorithm can be used to detect anomaly conditions. Many companies deal with anomaly data, and anomaly detection systems are in high demand. The domain business for each industry is needed to gain a deep understanding of the anomaly condition. For example, in the manufacturing industry, anomaly detection is needed as quality control. In the bank industry, an anomaly detection system is needed to detect genuine and fraudulent transactions.
The use of KNN distance here is excellent to detect a global anomaly. To see the application of the Local Outlier Factor Algorithm, we will use a dataset consist of many points. Besides using outlier detection, we can also use the method to detect anomalies in a dataset. Here, we will see the application of the method of fraud detection in the financial dataset.
This dataset is a synthetic dataset generated using the simulator called PaySim. The dataset contains financial transactions with fraud observations. PaySim simulates mobile money transactions based on a sample of genuine transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, the provider of the mobile financial service, which is currently running in more than 14 countries worldwide. The source of the data can be found here
fraud <- read.csv('fraud.csv')
str(fraud)## 'data.frame': 6362620 obs. of 11 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : chr "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : int 0 0 1 1 0 0 0 0 0 0 ...
## $ isFlaggedFraud: int 0 0 0 0 0 0 0 0 0 0 ...
Each fitures of the data is described as follows:
step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.amount - amount of the transaction in local currency.nameOrig- customer who started the transactionoldbalanceOrg - initial balance before the transactionnewbalanceOrig - new balance recipient after the transaction.nameDest - customer who is the recipient of the transactionoldbalanceDest - initial balance recipient before the transaction.newbalanceDest - new balance recipient after the transaction.isFraud - This is the transactions made by the fraudulent agents inside the simulation.isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts.Before we proceed with the algorithm, a few steps of dataset cleaning needs to be done. First, we need to convert the incorrect type of variables. Here, the variable of type needs to be converted into a factor. We also need to drop the nameOrig and nameDest variables since they contain many unique values.
fraud_clean <- fraud %>%
mutate(type = as.factor(type),
isFraud = as.factor(isFraud)) %>%
select(-c(nameOrig, nameDest))We can see the data once again
str(fraud_clean)## 'data.frame': 6362620 obs. of 9 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
## $ isFlaggedFraud: int 0 0 0 0 0 0 0 0 0 0 ...
The type of each variable is correct and we can proceed to do next analysis.
library(treemap)
transaction <- c("fraud","genuine")
value <- c(sum(fraud_clean$isFraud == 1), sum(fraud_clean$isFraud == 0))
percentage <- c(sum(fraud_clean$isFraud == 1)/length(fraud_clean$isFraud)*100,
sum(fraud_clean$isFraud == 0)/length(fraud_clean$isFraud)*100)
prop <- data.frame(transaction,value,percentage)
propThe proportion of genuine transactions is superior compared to the fraud one. The proportion of fraud transactions is less than 1% of overal transactions.
total_trans <- fraud_clean %>%
group_by(type) %>%
count(type) %>%
arrange(desc(n))library(highcharter)
hctreemap(trm, allowDrillToNode = TRUE) %>%
hc_title(text = "Number of Transactions Based on Transaction Type") %>%
hc_exporting(enabled = TRUE)The majority of the transactions came from Cash out, with over two million transactions. The following transactions are Payment, Cash in and Debit.
We can also visualize the distribution of fraud transactions based on each type.
total_fraud <- fraud_clean %>%
filter(isFraud == 1) %>%
select(type) %>%
group_by(type) %>%
count(type) %>%
arrange(desc(n))hctreemap(trmf, allowDrillToNode = TRUE) %>%
hc_title(text = "Fraud Transactions Based on Transaction Type") %>%
hc_exporting(enabled = TRUE)## Warning: 'hctreemap' is deprecated.
## Use 'data_to_hierarchical' instead.
## See help("Deprecated")
Here, we see that the fraud transactions only come from these two types of variables. The fraud transactions come from Cash_out and the transfer variable.
We will shrink the data to focus only on the fraud condition with these two types of transactions.
fraud_clean_real <- fraud_clean %>%
filter(type == "CASH_OUT" | type == "TRANSFER")
str(fraud_clean_real)## 'data.frame': 2770409 obs. of 9 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 5 2 2 5 5 2 2 2 2 5 ...
## $ amount : num 181 181 229134 215310 311686 ...
## $ oldbalanceOrg : num 181 181 15325 705 10835 ...
## $ newbalanceOrig: num 0 0 0 0 0 ...
## $ oldbalanceDest: num 0 21182 5083 22425 6267 ...
## $ newbalanceDest: num 0 0 51513 0 2719173 ...
## $ isFraud : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
## $ isFlaggedFraud: int 0 0 0 0 0 0 0 0 0 0 ...
dim(fraud_clean_real)## [1] 2770409 9
prop.table(table(fraud_clean_real$isFraud))*100##
## 0 1
## 99.7035456 0.2964544
The number of fraud transactions is 0.3% after the second cleaning. This is much lower compared to the genuine transactions by 99.7%. In the real world, we often face the situation of an imbalanced dataset because the fraudulent transaction is often relatively small than the actual transactions.
In reality, the dataset obtained from one experiment or situation does not have a labelled variable. The local outlier factor algorithm will assess the situation to detect which observation is likely to be an outlier compared to other observations. Before we do the algorithm to the data, we need to scale the numerical variable of each data. Scaling is vital since the calculation will be based on distance, and the distance should be calculated at the same level. We also need to drop the categorical variable.
fraud_scale <- fraud_clean_real %>% select(-c(isFraud, isFlaggedFraud, type))
fraud_scale <- as.data.frame(scale(fraud_scale))
head(fraud_scale)After we scale the data frame, we can use the lof function to calculate the lof score. Choosing the k value will determine the number of a neighbor of each data for the calculations. If choosing a small number of k will result in an algorithm that is sensitive to noise. If we choose a large amount of K, the algorithm will not recognize local anomalies. For this article, we will try using k = 40. Another approach of K can be used for the calculations.
fraud_lof <- lof(fraud_scale, k = 40)We then store the result to the original cleaned dataset for visualization.
fraud_clean_real$lof <- fraud_lof
head(fraud_clean_real)To gain more understanding of the data, we will see the distribution between fraud and genuine transactions. Since the dataset consists of more than two variables, we can use PCA and use the first two dimensions of the PCA.
library(FactoMineR)
library(factoextra)
fraud_pca <- PCA(fraud_scale, scale.unit = F, ncp = 6, graph = F)
summary(fraud_pca)##
## Call:
## PCA(X = fraud_scale, scale.unit = F, ncp = 6, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 2.238 1.788 0.997 0.756 0.213 0.008
## % of var. 37.301 29.804 16.624 12.595 3.544 0.132
## Cumulative % of var. 37.301 67.106 83.730 96.324 99.868 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 1.851 | -0.785 0.000 0.180 | -0.182 0.000 0.010 | -1.666
## 2 | 1.850 | -0.782 0.000 0.179 | -0.182 0.000 0.010 | -1.666
## 3 | 1.811 | -0.666 0.000 0.135 | -0.121 0.000 0.004 | -1.660
## 4 | 1.818 | -0.679 0.000 0.139 | -0.163 0.000 0.008 | -1.661
## 5 | 1.764 | -0.252 0.000 0.020 | -0.152 0.000 0.007 | -1.684
## 6 | 1.809 | -0.686 0.000 0.144 | -0.102 0.000 0.003 | -1.665
## 7 | 1.832 | -0.738 0.000 0.162 | -0.174 0.000 0.009 | -1.666
## 8 | 2.004 | 0.220 0.000 0.012 | -0.252 0.000 0.016 | -1.737
## 9 | 1.838 | -0.767 0.000 0.174 | -0.124 0.000 0.005 | -1.665
## 10 | 1.830 | -0.742 0.000 0.164 | 0.121 0.000 0.004 | -1.662
## ctr cos2
## 1 0.000 0.810 |
## 2 0.000 0.811 |
## 3 0.000 0.841 |
## 4 0.000 0.834 |
## 5 0.000 0.912 |
## 6 0.000 0.847 |
## 7 0.000 0.827 |
## 8 0.000 0.751 |
## 9 0.000 0.820 |
## 10 0.000 0.825 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## step | 0.077 0.265 0.006 | -0.008 0.004 0.000 | 0.996 99.454
## amount | 0.633 17.879 0.400 | 0.099 0.549 0.010 | 0.023 0.052
## oldbalanceOrg | 0.051 0.114 0.003 | 0.944 49.815 0.891 | 0.012 0.015
## newbalanceOrig | 0.019 0.016 0.000 | 0.937 49.127 0.879 | -0.013 0.016
## oldbalanceDest | 0.929 38.531 0.862 | -0.074 0.308 0.006 | -0.050 0.252
## newbalanceDest | 0.983 43.195 0.967 | -0.060 0.199 0.004 | -0.046 0.210
## cos2
## step 0.992 |
## amount 0.001 |
## oldbalanceOrg 0.000 |
## newbalanceOrig 0.000 |
## oldbalanceDest 0.003 |
## newbalanceDest 0.002 |
fviz_eig(fraud_pca, ncp = 6, addlabels = T, main = "Variance explained by each dimensions")The result from PCA above shows that if we use the first two dimensions of the data, we still retain 67% variance from the original data. The first two dimensions along with the LOF score and fraud label is obtained and stored in the new data frame.
fraud_a <- data.frame(fraud_pca$ind$coord[,1:3])
fraud_b <- cbind(fraud_a, fraud = fraud_clean_real$isFraud, lof_score = fraud_clean_real$lof)
fraud_lof_visual <- ggplot(fraud_b, aes(x=Dim.1 ,y=Dim.2, color=fraud)) +
geom_point(aes(size=lof_score)) +
ggtitle("LOF Score Distribution")+
theme_ipsum()
fraud_lof_visualFrom the visualization above, we can see that genuine and fraudulent transactions have different patterns. The higher the lof score it has, the dot is bolder and more prominent.
The rule of thumb of the lof score says that if the LOF score is more than 1, it is likely to be an outlier. Somehow, a threshold can be adjusted with the distribution of the data. Let’s see the statistics of the LOF score first.
summary(fraud_b)## Dim.1 Dim.2 Dim.3 fraud
## Min. : -0.78483 Min. : -8.2534 Min. :-6.76640 0:2762196
## 1st Qu.: -0.53961 1st Qu.: -0.2033 1st Qu.:-0.61967 1: 8213
## Median : -0.37178 Median : -0.1802 Median :-0.02065
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.03398 3rd Qu.: -0.0909 3rd Qu.: 0.65432
## Max. :111.61041 Max. :397.7378 Max. : 4.20721
## lof_score
## Min. :0.9414
## 1st Qu.:0.9970
## Median :1.0156
## Mean :1.0605
## 3rd Qu.:1.0587
## Max. :7.2294
The LOF score has a max score of more than 7. If we include a point that falls far from the distribution, it will be hard to visualize. Hence, we will set the LOF score until 1.75 and see the distribution of the LOF score.
fraud_b %>%
filter(lof_score <= 1.75) %>%
ggplot( aes(x=lof_score)) +
geom_density( color="#e9ecef", fill = "#c90076", alpha=0.7) +
scale_fill_manual(values="#8fce00") +
xlab("LOF Score")+
ggtitle("LOF Score Distribution")+
theme_ipsum() +
labs(fill="")We see above, the LOF score have many points with the score more than 1. To classify a point as an outlier or not, we can set the threshold higher.
One method to determine threshold is calculating the quantile point. Here, we will set threshold 90% as the normal points, while the last 10% is considered as outlier. The threshold proportion can be adjusted depend on the business case. If user wish to more cautios with the LOF score, user can set the threshold higher.
quantile(fraud_b$lof_score, probs = c(0, 0.9))## 0% 90%
## 0.9414236 1.1599704
The 90% proportion of the LOF score falls under below 1.1599704. We will use this threshold to determine if a point falls under the threshold; we categorize that point as an outlier.
fraud_b <- fraud_b %>%
mutate(outlier = ifelse(lof_score > 1.1599704, 1, 0))We can once again visualize the distribution of the outlier for all observations.
fraud_lof_visual_b <- ggplot(fraud_b, aes(x=Dim.1 ,y=Dim.2, color=outlier)) +
geom_point() +
ggtitle("LOF Score Distribution")+
theme_ipsum()
fraud_lof_visual_bThe visualization above shows us that there are outliers both in fraud or actual data. The user needs to investigate this observation and evaluate it in the system. By using the LOF algorithm, users can prevent unwanted observation from transactions. Preventive action is a must condition dealing with the condition for fraudulent transactions. The high lof score in genuine transactions should be interpreted with more domain background of the financial case. For example, there might be the high amount of transactions during that time, and it was real transaction by a person.
Many cases involved outlier or anomaly detection has not provided with labeled variable. If the dataset has provided with the labeled target variable, it can assess the detection quality. Local Outlier Factor will assess the unusual score without knowing the actual condition of the observation; for this case, it doesn’t know the actual value of a dataset is fraud/outlier or not. The reality of a dataset often provides us with a condition of whether a data point is an outlier or not. The local outlier factor is unsupervised machine learning that often does not involved labeled variables. The score produced by the algorithm can be an additional point of view in a supervised machine learning algorithm.
Some important notes about predicting outlier or anomaly conditions are too few cases in detecting diseases or the rapidly changing behavior in fraud transactions. If we face a situation like this, we can’t solely use unsupervised machine learning algorithms to detect. Some robust technique needs to be assessed to acquire better result.
The local outlier factor will produce a number in a ratio interval. If the score is more than 1, it means the density of the neighbors is higher. The interpretation of a ratio can be adjusted based on the business recommendation of a user rather than solely interpret lof > 1 as an outlier. If the user wants to be more selective in detecting outlier observation, the user can increase the threshold of the lof score based on the domain business.
Outlier is a condition where an observation in a dataset is different than the rest of the data. To detect if an observation is an outlier or not, several methods can be used. Grubbs’ test is a good tool for outlier detection, but it will not work in multivariate data, and a natural problem dataset often deals with a dataset with many variables. We can use the local outlier factor to handle the situation where we can use outlier detection in multivariate data. The algorithm will use a comparison of the density of observation to the density of K-nearest neighbor observation. The LOF needs K as the attribute to calculate its nearest neighbor observations, and by simulating different K, the algorithm will produce different lof scores. More K value is used, the more stable algorithm can detect an outlier. Although the algorithm is used to detect outlier observation, it can be used to detect anomaly situations. The result of fraud detection produces LOF with 86% precision. Although LOF can be used to assess supervised machine learning problems, note that several business case backgrounds need to be noticed, such as the proportion of imbalanced datasets.
Different K produces different quality of semi-supervised algorithm of LOF. To gain more understanding of the chosen K, the user can use many points of K and see if the algorithm will converge in specific K. The LOF contribute with numerical variable since it relies on the distance, to assess the situation with categorical features, different distance metrics may be used for this problem.
Breunig, M. M., Kriegel, H. P., Ng, R. T., and Sander, J. (2000). LOF: identifying density-based local outliers. In ACM sigmod record (Vol. 29, №2, pp. 93–104). ACM.