Incognia Data Analysis Report

Introduction

Statistical analysis is an efficient way to get insights about any data, making anyone able to ask the right question. In addition, using the best technologies is crucial to get better results and faster. Therefore, this analysis used the powerful Language R to load and transform the dataset provided, creating significant visualizations with RStudio tool. Thus, this report has the result of statistical thinking that collected meaningful data to answer the right question with detailed analysis of patterns found in the data, making possible to draw conclusions that go beyond the observed data.

This report contains parts of the R code the was considered important to explain how was obtained the analysis and charts. However, the full code will be sent at the same time.

Dataset description

Each event in the dataset analyzed was a login to the client’s app and the purpose of the analysis was to find patterns related to accounts and devices that indicate suspicious behavior, witch is possibly associated with fraud. The dataset contains records from July 2021 and it’s a great dataset for evaluating linear regression models.

Initial dataset schema

event_id: identifier of the event.
event_timestamp: event datetime in milliseconds.
account_id: identifier of the account associated with the event.
device: identifier of the device that performed the operation.
distance_to_frequent_location: distance (in meters) from the device, at the time of the event, to one of the frequent locations related to the account.
device_age_days: days since an account appeared related to a device.
is_emulator: indicates whether the device is an emulator.
has_fake_location: indicates whether the device was using false locations at the time of the operation.
has_root_permissions: indicates whether the device has device administrator permissions.
app_is_from_official_store: indicates whether or not the app used to perform the operation came from an official store.

Dataset processing

Loading dataset

tb <- read.table("Dados/hugo_incognia_db_for_da_test.csv",
                    dec = ".",
                    sep = ",",
                    h = T,
                    fileEncoding = "windows-1252")

Dataset general information

Dataset size is 444758 rows and 10 columns

Dataset transformation

Removing id column, not necessary for analysis

tb$event_id <- NULL

Shortening some column names

names(tb)[4] <- "distance_fl"
names(tb)[5] <- "device_age"
names(tb)[7] <- "fake_location"
names(tb)[8] <- "root_permissions"
names(tb)[9] <- "official_store"

Converting variable to integer

tb$distance_fl <- as.integer(tb$distance_fl)

Converting miliseconds to timestamp

tb$event_timestamp <- as.POSIXct(tb$event_timestamp / 1000, 
                                 origin = "1970-01-01", 
                                 tz = "UTC")

Creating variable Event Hour, Date and Weekday

tb$event_hour <- format(tb$event_timestamp,"%H")
tb$event_hour <- as.numeric(tb$event_hour)
tb$event_date <- as.Date(tb$event_timestamp,format="%Y-%m-%d")
Sys.setlocale("LC_TIME","English")

## [1] "English_United States.1252"

tb$weekday <- weekdays(tb$event_date)
tb$wday <- as.integer(as.POSIXlt(tb$event_date)$wday)

The date range of the dataset is: 2021-07-01 to 2021-07-31

Missing values

Amelia::missmap(tb, main = "Missing Values")

There are 678 missing values. However, all missing values are in a location variable, which is crucial to determine risk because it’s a login attempt without location information. Solution was replacing “NA” for -1 instead removing.

tb <- tidyr::replace_na(tb, list(distance_fl = -1))

Creating a Device Age Category

tb$age_category <- cut(tb$device_age, 
            breaks = c(0,1,7,30,365,Inf),
            labels = c("Day",
                       "Week",
                       "Month",
                       "Year",
                       "Year+"), right = FALSE)

Creating a Distance to frequent location Category

tb$distance_group <- cut(tb$distance_fl, 
                              breaks = c(-1,0,1,10,Inf),
                              labels = c("No Location",
                                         "Freq Location",
                                         "Near FL",
                                         "Far FL"), 
                         right = FALSE)

Creating Risk Score variable

tb$score_risk = 0
tb$score_risk <- tb$score_risk + 
  ifelse(tb$event_hour>=0 & tb$event_hour<=6,1,0)
tb$score_risk <- tb$score_risk + 
  ifelse(tb$distance_group=="No Location",3,0)
tb$score_risk <- tb$score_risk + 
  ifelse(tb$distance_group=="Far FL",2,0)
tb$score_risk <- tb$score_risk + 
  ifelse(tb$distance_group=="Near FL",1,0)
tb$score_risk <- tb$score_risk + 
  ifelse(tb$age_category=="Day",1,0)
tb$score_risk <- tb$score_risk + 
  ifelse(tb$is_emulator=="true",3,0)
tb$score_risk <- tb$score_risk + 
  ifelse(tb$fake_location=="true",3,0)
tb$score_risk <- tb$score_risk + 
  ifelse(tb$root_permissions=="true",3,0)
tb$score_risk <- tb$score_risk + 
  ifelse(tb$official_store=="false",3,0)

Creating Risk Level (category based on score)

tb$risk_level <- cut(tb$score_risk, 
                              breaks = c(0,2,3,Inf),
                              labels = c("Low",
                                         "Medium",
                                         "High"), 
                     right = FALSE)

Variables added to the dataset schema

event_hour: hour of the event day.
event_date: event date (yyyy-mm-dd)
weekday: day of the week
wday: number of day of the week
age_category: variable device_age sliced in a category (“Day”, “Week”, “Month”, “Year”, “Year+”)
distance_group: variable distance_fl sliced in a category (“No Location”, “Freq Location”, “Near FL”, “Far FL”)
score_risk: variable that accumulates points whenever another variable indicates some risk
risk_level: variable score_risk sliced in a category (“Low”, “Medium”, “High”)

Graphical representation of dataset

Heatmap to analyze correlation between variables

In this heatmap it is possible to observe that there is little or no correlation between the numerical variables.

M = cor(tb_num)
corrplot::corrplot(M, method = 'color', order = 'alphabet')

Barplot analysis

In the bar charts below, it is possible to observe the distribution of events in the risk_level variable, in which most of the events in this dataset can be considered as low risk.

Regarding distance_to_frequent_location, the information shows that most of the events occurred in places that are not exactly at the frequent location.

About the device age category, most of the events are associated with devices aged between one month and one year.

According to the Hours of Day chart, it is quite visible that few events occurred between midnight and 6am.

Among the Boolean variables, there were very few contradictory cases related to suspicious behavior of the device.

Boxplot analysis - Numeric Variables

In the box plot diagram below, it is possible to analyze the representation of the observed data variation of the numeric variables.

Boxplot analysis - Boolean Variables

The box plots below show the data variation of the Boolean variables in relation to the Device Age to identify any correlation between all of them.

Moreover, in the last box plot below, there is a relationship between Device Age and Distance to Frequent Location category, showing how most of the devices with no locations available are device in the minimal age.

Scatter Plot analysis

In the scatter plot below, it is easy observe the data and how scattered data is. In addition, there is little relationship between the distance from the frequent location and device age.

Histogram of distribution

This histogram illustrates the distribution of events in relation to the maximum age found for each device, concluding that there are usually many more devices with a lower age.

Area chart analysis

The area chart gives a good perception of the proportionality of the risk level over the days the event occurred. According to the chart, there is no oscillation in high-risk events in relation to the day of the week, while the risk levels low and medium show oscillation on days that are weekends.
Meanwhile, the stacked bar chart below confirms this information showing fewer events on Sundays.

Stacked bar chart analysis

Analysis Conclusion

The dataset analysis found patterns related to the accounts and devices that indicate suspicious behavior, possibly associated with fraud. In this report, it is possible to understand how these patterns occur and how often. Moreover, with the patterns found, it was possible to create a variable that measures risk with a score points. The higher the score, the greater the risk. In addition, another variable was defined to categorize the risk in 3 levels: Low, Medium and High. Thus, Incognia will be able to improve the detection algorithm and increase efficiency in the communication with the financial client in order to avoid fraud.

Machine Learning Model

Objective

Create a machine learning model to predict whether a financial event is a fraud event based on patterns found in a dataset.

Preparing data to the model

Sample random rows in dataframe

Only 3000 rows were selected at random from the dataset. More than this value requires a higher computational level.

Sample random rows in dataframe

df = data.frame(tb)
tb_ml <- df[sample(nrow(df), 3000), ]

Suggesting a variable as a possible fraud Risk Score greater than 4.

tb_ml$is_fraud <- ifelse(tb_ml$score_risk>=4,1,0)

Selecting rows according to variable IS_FRAUD

index <- createDataPartition(tb_ml$is_fraud, 
                             p = 0.75, 
                             list = FALSE)

Setting training data as a subset

data_training <- tb_ml[index,]

Percentage comparison between training classes and original dataset

data_comparison

##      Training Original
## 0 0.996890271    0.997
## 1 0.003109729    0.003

Plot - Training vs original

ggplot(melt_data_comparison, aes(x = X1, y = value)) + 
  geom_bar( aes(fill = X2), 
            stat = "identity", 
            position = "dodge") + 
  ggtitle("Training vs original") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Building model version 1

model_v1 <- randomForest(is_fraud ~ ., data = data_training)

## 
## Call:
##  randomForest(formula = is_fraud ~ ., data = data_training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.31%
## Confusion matrix:
##      0 1 class.error
## 0 2244 0           0
## 1    7 0           1

Predicting the test dataset

predict_v1 <- predict(model_v1, data_test)

Confusion Matrix to calculate a cross-tabulation of observed and predicted classes

cm_v1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 747   2
##          1   0   0
##                                           
##                Accuracy : 0.9973          
##                  95% CI : (0.9904, 0.9997)
##     No Information Rate : 0.9973          
##     P-Value [Acc > NIR] : 0.6767          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.99733         
##              Prevalence : 0.00267         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : 1               
##

Precision, Recall e F1-Score, measures to evaluate predict model

y <- data_test$is_fraud
y_pred_v1 <- predict_v1

precision <- posPredValue(y_pred_v1, y)
recall <- sensitivity(y_pred_v1, y)
F1 <- (2 * precision * recall) / (precision + recall)

df2 <- data.frame(precision,recall,F1)
names(df2) <- c("Precision","Recall","F1")
df2

##   Precision Recall        F1
## 1 0.9973298      1 0.9986631

SMOTE algorithm for unbalanced classification problems

set.seed(9560)
data_training_bal <- SMOTE(is_fraud ~ ., 
                           data  = data_training)

Building model version 2

plot(model_v2)

Predicting the test dataset

predict_v2 <- predict(model_v2, data_test)

Confusion Matrix

cm_v2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 680   1
##          1  67   1
##                                           
##                Accuracy : 0.9092          
##                  95% CI : (0.8863, 0.9288)
##     No Information Rate : 0.9973          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0235          
##                                           
##  Mcnemar's Test P-Value : 3.211e-15       
##                                           
##             Sensitivity : 0.500000        
##             Specificity : 0.910308        
##          Pos Pred Value : 0.014706        
##          Neg Pred Value : 0.998532        
##              Prevalence : 0.002670        
##          Detection Rate : 0.001335        
##    Detection Prevalence : 0.090788        
##       Balanced Accuracy : 0.705154        
##                                           
##        'Positive' Class : 1               
##

Precision, Recall e F1-Score

y <- data_test$is_fraud
y_pred_v2 <- predict_v2

precision <- posPredValue(y_pred_v2, y)
recall <- sensitivity(y_pred_v2, y)
F1 <- (2 * precision * recall) / (precision + recall)

df3 <- data.frame(precision,recall,F1)
names(df3) <- c("Precision","Recall","F1")
df3

##   Precision    Recall       F1
## 1 0.9985316 0.9103079 0.952381

Most important variables to predict

varImpPlot(model_v2)

Ranking of the most important variables

Building model version 3 with the most important variables

model_v3 <- randomForest(is_fraud ~ device_age + 
                           event_hour + 
                           distance_fl +
                           weekday +
                           official_store +
                           root_permissions, 
                          data = data_training_bal)

plot(model_v3)

Predicting the test dataset

predict_v3 <- predict(model_v3, data_test)

Confusion Matrix

cm_v3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 647   1
##          1 100   1
##                                           
##                Accuracy : 0.8652          
##                  95% CI : (0.8386, 0.8888)
##     No Information Rate : 0.9973          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0143          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.500000        
##             Specificity : 0.866131        
##          Pos Pred Value : 0.009901        
##          Neg Pred Value : 0.998457        
##              Prevalence : 0.002670        
##          Detection Rate : 0.001335        
##    Detection Prevalence : 0.134846        
##       Balanced Accuracy : 0.683066        
##                                           
##        'Positive' Class : 1               
##

Precision, Recall e F1-Score

y <- data_test$is_fraud
y_pred_v3 <- predict_v3

precision <- posPredValue(y_pred_v3, y)
recall <- sensitivity(y_pred_v3, y)
F1 <- (2 * precision * recall) / (precision + recall)
df4 <- data.frame(precision,recall,F1)
names(df4) <- c("Precision","Recall","F1")
df4

##   Precision    Recall        F1
## 1 0.9984568 0.8661312 0.9275986

Salving model file

rds <- paste("Dados/",
             format(Sys.time(), "%y%m%d%H%M%S"), 
             "_model_v3.rds",sep = "")
saveRDS(model_v3, file = rds)

Predicting three random events

# Input
device_age <- c(0, 10, 100) 
event_hour <- c("4", "10", "11") 
distance_fl <- c(1000, 1, 0)
weekday <- c("Sunday","Sunday","Sunday")
official_store <- c("true","true","true")
root_permissions <- c("false","false","false")

Predictions results

pred_new_events <- predict(model_v3, new_events)
pred = data.frame(pred_new_events)
pred$is_fraud <- ifelse(pred_new_events==0,"No","Yes")
pred

##   pred_new_events is_fraud
## 1               1      Yes
## 2               0       No
## 3               1      Yes

Model Results

To create a machine learning model, it would be important to have a sample with descriptive information determining whether each event is a fraud event. With this information, it would be possible to create a prediction algorithm model to predict which future events will be a fraud event based on the training and testing model.

However, the dataset provided did not contain the information about fraud (if was a fraud - yes/no), then a risk score was suggested based on the patterns found in the preliminary analysis. Thus, based on the hypothesis that a risk score of 4+ is a fraud event, a machine learning model was developed to predict future events.

According to the model results, the “device age” and **“event hour”* were the most important variables in the dataset provided, combined with the variables “distance to frequent location” and “app is from the official store”. To illustrate, every time this model is used, the ranking order of the most important variables changes depending on the randomly generated dataset of 3000 rows.

However, it is important having more data about de event, such as “distance to frequent location is inside or outside a building”. In addition, having rules defined about “risk level” is exactly the main point to make a efficient algorithm and do the right recommendations to the client, but for this analysis, the recommendation is based on the variables that was considered important to the case analysis.

To summarize, this model can predict with high precision and high sensitivity based on rules defined in previous analysis. Therefore, is crucial is having additional data such as (fraud: yes/no) to create a predictive model without being biased by author analysis.

Additional Observations

According to recent research, it is a tendency to use some business intelligence tools to improve visualizations and to interact with the charts and tables. Because this, a Microsoft Power BI dashboard was developed in addition by using as a source the dataset regenerated in this analysis. Thus, the Power BI file (pbix) will be sent as part of this analysis to observe how data interact between charts.

Incognia Data Analysis

Hugo Daher

2022-03-04

Incognia Data Analysis Report

Introduction

Dataset description

Initial dataset schema

Dataset processing

Loading dataset

Dataset general information

Dataset transformation

Variables added to the dataset schema

Graphical representation of dataset

Analysis Conclusion

Machine Learning Model

Objective

Preparing data to the model

Model Results

Additional Observations

Incognia Data Analysis

Hugo Daher

2022-03-04

Incognia Data Analysis Report

Introduction

Dataset description

Initial dataset schema

Dataset processing

Loading dataset

Dataset general information

Dataset transformation

Variables added to the dataset schema

Graphical representation of dataset

Analysis Conclusion

Machine Learning Model

Objective

Preparing data to the model

Model Results

Additional Observations

Related Links