Company XYZ is an e-commerce site that sells hand made clothes. I have to build a model that predicts whether a user has high probability of using the site to perform illegal activity or not.
We only have information about the user first transaction on the site and based on that we have to make your classification (“fraud/no fraud”).
For each user, determine her country based on the numeric IP address.
Build a model to predict whether an activity is fraudulent or not. Explain how different assumptions about the cost of false positives vs false negatives would impact the model.
Your boss is a bit worried about using a model she doesn’t understand for something as important as fraud detection. How would you explain her how the model is making the predictions? Not from a mathematical perspective (she couldn’t care less about that), but from a user perspective. What kinds of users are more likely to be classified as at risk? What are their characteristics?
user_id : Id of the user. Unique by user
signup_time : the time when the user created her account (GMT time)
purchase_time :the time when the user bought the item (GMT time)
purchase_value : the cost of the item purchased (USD)
device_id : the device id. You can assume that it is unique by device. I.e., transactions with the same device ID means that the same physical device was used to buy
source : user marketing channel: ads, SEO, Direct (i.e. came to the site by directly typing the site address on the browser).
browser : the browser used by the user.
sex : user sex: Male/Female .
age : user age .
ip_address : user numeric ip address .
class : this is what we are trying to predict: whether the activity was fraudulent (1) or not (0).
2.“IpAddress_to_Country” - mapping each numeric ip address to its country. For each country, it gives a range. If the numeric ip address falls within the range, then the ip address belongs to the corresponding country.
lower_bound_ip_address : the lower bound of the numeric ip address for that country
upper_bound_ip_address : the upper bound of the numeric ip address for that country
country : the corresponding country. If a user has an ip address whose value is within the upper and lower bound, then she is based in this country.
knitr::opts_chunk$set(echo = TRUE)
library("ggplot2")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("lubridate")
## Warning: package 'lubridate' was built under R version 3.5.3
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library("caret")
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
library("rpart")
## Warning: package 'rpart' was built under R version 3.5.3
library("rpart.plot")
## Warning: package 'rpart.plot' was built under R version 3.5.3
library("rattle")
## Warning: package 'rattle' was built under R version 3.5.3
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library("randomForest")
## Warning: package 'randomForest' was built under R version 3.5.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library("gbm")
## Warning: package 'gbm' was built under R version 3.5.3
## Loaded gbm 2.1.5
fraud_data<-read.csv("D:/profile/documents/Fraud_Data.csv")
ip_add<-read.csv("D:/profile/documents/IpAddress_to_Country.csv")
head(fraud_data)
## user_id signup_time purchase_time purchase_value device_id
## 1 22058 2/24/2015 22:55 4/18/2015 2:47 34 QVPSPJUOCKZAR
## 2 333320 6/7/2015 20:39 6/8/2015 1:38 16 EOGFQPIZPYXFZ
## 3 1359 1/1/2015 18:52 1/1/2015 18:52 15 YSSKYOSJHPPLJ
## 4 150084 4/28/2015 21:13 5/4/2015 13:54 44 ATGTXKYKUDUQN
## 5 221365 7/21/2015 7:09 9/9/2015 18:40 39 NAUITBZFJKHWW
## 6 159135 5/21/2015 6:03 7/9/2015 8:05 42 ALEYXFXINSXLZ
## source browser sex age ip_address class
## 1 SEO Chrome M 39 732758369 0
## 2 Ads Chrome F 53 350311388 0
## 3 SEO Opera M 53 2621473820 1
## 4 SEO Safari M 41 3840542444 0
## 5 Ads Safari M 45 415583118 0
## 6 Ads Chrome M 18 2809315200 0
dim(fraud_data)
## [1] 151112 11
head(ip_add)
## lower_bound_ip_address upper_bound_ip_address country
## 1 16777216 16777471 Australia
## 2 16777472 16777727 China
## 3 16777728 16778239 China
## 4 16778240 16779263 Australia
## 5 16779264 16781311 China
## 6 16781312 16785407 Japan
dim(ip_add)
## [1] 138846 3
Ctreate a function which can map ip address from fraud_data to ip_add and tell which country user belongs to.
data_country<- rep(NA, nrow(fraud_data))
for (i in 1:nrow(fraud_data)) {
vec<- as.character(ip_add[fraud_data$ip_address[i] >= ip_add$lower_bound_ip_address & fraud_data$ip_address[i] <= ip_add$upper_bound_ip_address, "country"])
if(length(vec) == 1){data_country[i] = vec}
}
fraud_data$country <- data_country
head(fraud_data)
## user_id signup_time purchase_time purchase_value device_id
## 1 22058 2/24/2015 22:55 4/18/2015 2:47 34 QVPSPJUOCKZAR
## 2 333320 6/7/2015 20:39 6/8/2015 1:38 16 EOGFQPIZPYXFZ
## 3 1359 1/1/2015 18:52 1/1/2015 18:52 15 YSSKYOSJHPPLJ
## 4 150084 4/28/2015 21:13 5/4/2015 13:54 44 ATGTXKYKUDUQN
## 5 221365 7/21/2015 7:09 9/9/2015 18:40 39 NAUITBZFJKHWW
## 6 159135 5/21/2015 6:03 7/9/2015 8:05 42 ALEYXFXINSXLZ
## source browser sex age ip_address class country
## 1 SEO Chrome M 39 732758369 0 Japan
## 2 Ads Chrome F 53 350311388 0 United States
## 3 SEO Opera M 53 2621473820 1 United States
## 4 SEO Safari M 41 3840542444 0 <NA>
## 5 Ads Safari M 45 415583118 0 United States
## 6 Ads Chrome M 18 2809315200 0 Canada
fraud_data$signup_time<- as.POSIXct(fraud_data$signup_time, format = "%m/%d/%Y %H:%M", tz = "GMT")
fraud_data$purchase_time<- as.POSIXct(fraud_data$purchase_time, format = "%m/%d/%Y %H:%M", tz = "GMT")
summary(as.factor(fraud_data$country))
## United States China
## 58049 12038
## Japan United Kingdom
## 7306 4490
## Korea Republic of Germany
## 4162 3646
## France Canada
## 3161 2975
## Brazil Italy
## 2961 1944
## Australia Netherlands
## 1844 1680
## Russian Federation India
## 1616 1310
## Taiwan; Republic of China (ROC) Mexico
## 1237 1121
## Sweden Spain
## 1090 1027
## South Africa Switzerland
## 838 785
## Poland Argentina
## 729 661
## Indonesia Norway
## 649 609
## Colombia Turkey
## 602 568
## Viet Nam Romania
## 550 525
## Denmark Hong Kong
## 490 471
## Finland Austria
## 460 435
## Ukraine Chile
## 429 417
## Belgium Iran (ISLAMIC Republic Of)
## 409 389
## Egypt Czech Republic
## 359 349
## Thailand New Zealand
## 291 278
## Israel Saudi Arabia
## 272 264
## Venezuela Ireland
## 251 240
## European Union Greece
## 238 231
## Portugal Hungary
## 229 211
## Malaysia Singapore
## 210 208
## Pakistan Philippines
## 186 177
## Bulgaria Morocco
## 166 158
## Algeria Peru
## 122 119
## Tunisia United Arab Emirates
## 118 114
## Ecuador Lithuania
## 106 95
## Seychelles Kenya
## 95 93
## Kazakhstan Costa Rica
## 92 90
## Kuwait Slovenia
## 90 87
## Slovakia (SLOVAK Republic) Uruguay
## 86 80
## Croatia (LOCAL Name: Hrvatska) Belarus
## 79 72
## Luxembourg Serbia
## 72 69
## Nigeria Latvia
## 67 64
## Panama Bolivia
## 62 53
## Dominican Republic Cyprus
## 51 43
## Estonia Oman
## 42 41
## Bangladesh Moldova Republic of
## 37 37
## Paraguay Georgia
## 35 32
## Sri Lanka Bosnia and Herzegowina
## 31 30
## Puerto Rico Jordan
## 30 28
## Lebanon El Salvador
## 28 25
## Qatar Sudan
## 25 25
## Angola Macedonia
## 24 24
## Syrian Arab Republic Azerbaijan
## 24 23
## Namibia Malta
## 23 22
## (Other) NA's
## 550 21966
summary(fraud_data)
## user_id signup_time
## Min. : 2 Min. :2015-01-01 00:00:00
## 1st Qu.:100643 1st Qu.:2015-02-18 09:52:45
## Median :199958 Median :2015-04-19 04:41:00
## Mean :200171 Mean :2015-04-20 00:55:40
## 3rd Qu.:300054 3rd Qu.:2015-06-18 14:47:15
## Max. :400000 Max. :2015-08-18 04:40:00
##
## purchase_time purchase_value device_id
## Min. :2015-01-01 00:00:00 Min. : 9.00 CQTUVBYIWWWBC: 20
## 1st Qu.:2015-04-18 14:41:00 1st Qu.: 22.00 EQYVNEGOFLAWK: 20
## Median :2015-06-18 13:45:30 Median : 35.00 ITUMJCKWEYNDD: 20
## Mean :2015-06-16 02:56:09 Mean : 36.94 KIPFSCNUGOLDP: 20
## 3rd Qu.:2015-08-17 18:48:00 3rd Qu.: 49.00 NGQCKIADMZORL: 20
## Max. :2015-12-16 02:56:00 Max. :154.00 ZUSVMDEZRBDTX: 20
## (Other) :150992
## source browser sex age
## Ads :59881 Chrome :61432 F:62819 Min. :18.00
## Direct:30616 FireFox:24610 M:88293 1st Qu.:27.00
## SEO :60615 IE :36727 Median :33.00
## Opera : 3676 Mean :33.14
## Safari :24667 3rd Qu.:39.00
## Max. :76.00
##
## ip_address class country
## Min. :5.209e+04 Min. :0.00000 Length:151112
## 1st Qu.:1.086e+09 1st Qu.:0.00000 Class :character
## Median :2.155e+09 Median :0.00000 Mode :character
## Mean :2.152e+09 Mean :0.09365
## 3rd Qu.:3.243e+09 3rd Qu.:0.00000
## Max. :4.295e+09 Max. :1.00000
##
summary(ip_add)
## lower_bound_ip_address upper_bound_ip_address country
## Min. :1.678e+07 Min. :1.678e+07 United States :46868
## 1st Qu.:1.920e+09 1st Qu.:1.920e+09 Canada : 6989
## Median :3.231e+09 Median :3.231e+09 Russian Federation: 6739
## Mean :2.725e+09 Mean :2.725e+09 Australia : 6316
## 3rd Qu.:3.350e+09 3rd Qu.:3.350e+09 Germany : 5999
## Max. :3.758e+09 Max. :3.758e+09 United Kingdom : 5401
## (Other) :60534
sum(is.na(fraud_data$country))
## [1] 21966
We can see in fraud_data minimum value of ip_address is lower than the min of lower_bound_ip_address and max value of ip_address is higher than the max value of upper_bound_ip_address. and we have 21966 NAs in country of fraud_data.
We will explore the relation between gender and class to see if fraud has something to with gender.
counts <- table(fraud_data$class, fraud_data$sex)
barplot(counts, main="Gender and Class",
xlab="Gender", col=c("darkblue","red"),
legend = rownames(counts), beside=TRUE)
Here we see m ales make up more of the customer base of the company and thus are also involved in more of the fraud cases. It doesn’t look like one gender is more likely to participate in fraud though, just that one gender tends to use the site more in general.
To detect Fraud few points come to our observation like how much time difference is there between signup and purchase, then if divice id is unique etc.. We will create few new variables :
fraud_data$diff_time <- as.numeric(difftime( fraud_data$purchase_time , fraud_data$signup_time, units = "secs"))
plot diff_time vs class
fraud_data$class<- as.factor(fraud_data$class)
ggplot(fraud_data, aes(x = class, y = diff_time))+ geom_boxplot()
fraud_data <- fraud_data %>%
group_by(device_id) %>%
mutate (device_id_count = n())
table( fraud_data$device_id_count,fraud_data$class)
##
## 0 1
## 1 127771 4010
## 2 8214 2440
## 3 204 66
## 4 6 10
## 5 13 52
## 6 30 144
## 7 49 301
## 8 69 483
## 9 78 624
## 10 93 827
## 11 102 1009
## 12 88 992
## 13 65 767
## 14 58 740
## 15 47 568
## 16 38 538
## 17 14 207
## 18 11 169
## 19 5 90
## 20 6 114
fraud_data<- fraud_data %>%
group_by(ip_address) %>%
mutate(ip_address_count = n())
fraud_data$signup_weekday <- format(fraud_data$signup_time, "%A")
fraud_data$purchase_weekday <- format(fraud_data$purchase_time, "%A" )
fraud_data$signup_weekyear <- as.numeric(format(fraud_data$signup_time, "%U"))
fraud_data$purchase_weekyear <- as.numeric(format(fraud_data$purchase_time, "%U" ))
Now let us drop some unwanted variables.
fraud_data1<- fraud_data[,-c(1,2,3,5)]
fraud_data1$country[is.na(fraud_data1$country)]="Not_found"
fraud_data1$country = ifelse(fraud_data1$country %in% names(sort(table(fraud_data1$country),decreasing = TRUE))[51:length(unique(fraud_data1$country))],"Other", as.character(fraud_data1$country))
fraud_data1$class<-as.factor(fraud_data1$class)
fraud_data1$country<- as.factor(fraud_data1$country)
fraud_data1$signup_weekday<-as.factor(fraud_data1$signup_weekday)
fraud_data1$purchase_weekday<- as.factor(fraud_data1$purchase_weekday)
fraud_data1<- fraud_data1[complete.cases(fraud_data1),]
fraud_data1$country[is.na(fraud_data1$country)]="Not_found"
set.seed(12345)
inTrain<-createDataPartition(fraud_data1$class, p = 0.7, list = FALSE)
trainData<-fraud_data1[inTrain , ]
testData<-fraud_data1[-inTrain , ]
dim(trainData)
## [1] 105779 15
dim(testData)
## [1] 45333 15
NZV <- nearZeroVar(trainData)
trainData1 <- trainData[, -NZV]
testData1 <- testData[, -NZV]
dim(trainData1)
## [1] 105779 14
dim(testData1)
## [1] 45333 14
We will predict outome for our model by using 3 different techniques: 1. Classification Trees 2. Random Forest 3.Generalized Boosted Models
set.seed(11111)
model_tree<-rpart(class ~ . , data = trainData1, method = "class")
fancyRpartPlot(model_tree)
Now using this model we will see how this is performing on our Test Data .
predict_model_tree<- predict(model_tree, testData1, type = "class")
cm_tree<- confusionMatrix(predict_model_tree, testData$class )
cm_tree$table
## Reference
## Prediction 0 1
## 0 41088 1995
## 1 0 2250
cm_tree$overall[1]
## Accuracy
## 0.9559923
So we can see the above model gives accuracy of 0.9285714 which gives us out-of-sample-error about .07.
model_rf<-randomForest(class ~ . , data = trainData1)
print(model_rf)
##
## Call:
## randomForest(formula = class ~ ., data = trainData1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 4.35%
## Confusion matrix:
## 0 1 class.error
## 0 95807 66 0.0006884107
## 1 4535 5371 0.4578033515
Now using this model we will see how this is performing on our Test Data .
predict_model_rf<- predict(model_rf, testData1, type = "class")
cm_rf<-confusionMatrix(predict_model_rf,testData1$class)
cm_rf$table
## Reference
## Prediction 0 1
## 0 41066 1982
## 1 22 2263
cm_rf$overall[1]
## Accuracy
## 0.9557938
So we can see the above model gives accuracy of 0.9557938 which gives us out-of-sample-error about .0442.
model_gbm<- train(class ~ . , data = trainData1, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 1), verbose = FALSE)
print(model_gbm)
## Stochastic Gradient Boosting
##
## 105779 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 84623, 84622, 84623, 84624, 84624
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9569196 0.6802609
## 1 100 0.9569291 0.6803869
## 1 150 0.9569196 0.6803353
## 2 50 0.9569007 0.6801950
## 2 100 0.9569196 0.6804093
## 2 150 0.9569196 0.6804093
## 3 50 0.9568913 0.6801432
## 3 100 0.9568440 0.6800314
## 3 150 0.9568346 0.6801264
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
Now using this model we will see how this is performing on our Test Data .
predict_model_gbm<- predict(model_gbm, testData1)
cm_gbm<-confusionMatrix(predict_model_gbm,testData1$class)
cm_gbm$table
## Reference
## Prediction 0 1
## 0 41088 1995
## 1 0 2250
cm_gbm$overall[1]
## Accuracy
## 0.9559923
So we can see the above model gives accuracy of 0.9559923 which gives us out-of-sample-error about .00440 .
We have observed out of three models our most accurate model come out to be GBM model, so we will use this model to detect the fraud.