DATA CHALLENGE

Introduction to Task

Company XYZ is an e-commerce site that sells hand made clothes. I have to build a model that predicts whether a user has high probability of using the site to perform illegal activity or not.

We only have information about the user first transaction on the site and based on that we have to make your classification (“fraud/no fraud”).

Tasks

For each user, determine her country based on the numeric IP address.
Build a model to predict whether an activity is fraudulent or not. Explain how different assumptions about the cost of false positives vs false negatives would impact the model.
Your boss is a bit worried about using a model she doesn’t understand for something as important as fraud detection. How would you explain her how the model is making the predictions? Not from a mathematical perspective (she couldn’t care less about that), but from a user perspective. What kinds of users are more likely to be classified as at risk? What are their characteristics?

Data1

“Fraud_Data” - information about each user first transaction

Columns:

user_id : Id of the user. Unique by user
signup_time : the time when the user created her account (GMT time)
purchase_time :the time when the user bought the item (GMT time)
purchase_value : the cost of the item purchased (USD)
device_id : the device id. You can assume that it is unique by device. I.e., transactions with the same device ID means that the same physical device was used to buy
source : user marketing channel: ads, SEO, Direct (i.e. came to the site by directly typing the site address on the browser).
browser : the browser used by the user.
sex : user sex: Male/Female .
age : user age .
ip_address : user numeric ip address .
class : this is what we are trying to predict: whether the activity was fraudulent (1) or not (0).

Data2

2.“IpAddress_to_Country” - mapping each numeric ip address to its country. For each country, it gives a range. If the numeric ip address falls within the range, then the ip address belongs to the corresponding country.

Columns:

lower_bound_ip_address : the lower bound of the numeric ip address for that country
upper_bound_ip_address : the upper bound of the numeric ip address for that country
country : the corresponding country. If a user has an ip address whose value is within the upper and lower bound, then she is based in this country.

Download Data

knitr::opts_chunk$set(echo = TRUE)

library("ggplot2")

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("lubridate")

## Warning: package 'lubridate' was built under R version 3.5.3

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library("caret")

## Warning: package 'caret' was built under R version 3.5.3

## Loading required package: lattice

library("rpart")

## Warning: package 'rpart' was built under R version 3.5.3

library("rpart.plot")

## Warning: package 'rpart.plot' was built under R version 3.5.3

library("rattle")

## Warning: package 'rattle' was built under R version 3.5.3

## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library("randomForest")

## Warning: package 'randomForest' was built under R version 3.5.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library("gbm")

## Warning: package 'gbm' was built under R version 3.5.3

## Loaded gbm 2.1.5

fraud_data<-read.csv("D:/profile/documents/Fraud_Data.csv")

ip_add<-read.csv("D:/profile/documents/IpAddress_to_Country.csv")

head(fraud_data)

##   user_id     signup_time  purchase_time purchase_value     device_id
## 1   22058 2/24/2015 22:55 4/18/2015 2:47             34 QVPSPJUOCKZAR
## 2  333320  6/7/2015 20:39  6/8/2015 1:38             16 EOGFQPIZPYXFZ
## 3    1359  1/1/2015 18:52 1/1/2015 18:52             15 YSSKYOSJHPPLJ
## 4  150084 4/28/2015 21:13 5/4/2015 13:54             44 ATGTXKYKUDUQN
## 5  221365  7/21/2015 7:09 9/9/2015 18:40             39 NAUITBZFJKHWW
## 6  159135  5/21/2015 6:03  7/9/2015 8:05             42 ALEYXFXINSXLZ
##   source browser sex age ip_address class
## 1    SEO  Chrome   M  39  732758369     0
## 2    Ads  Chrome   F  53  350311388     0
## 3    SEO   Opera   M  53 2621473820     1
## 4    SEO  Safari   M  41 3840542444     0
## 5    Ads  Safari   M  45  415583118     0
## 6    Ads  Chrome   M  18 2809315200     0

dim(fraud_data)

## [1] 151112     11

head(ip_add)

##   lower_bound_ip_address upper_bound_ip_address   country
## 1               16777216               16777471 Australia
## 2               16777472               16777727     China
## 3               16777728               16778239     China
## 4               16778240               16779263 Australia
## 5               16779264               16781311     China
## 6               16781312               16785407     Japan

dim(ip_add)

## [1] 138846      3

Task1

Ctreate a function which can map ip address from fraud_data to ip_add and tell which country user belongs to.

data_country<- rep(NA, nrow(fraud_data))
for (i in 1:nrow(fraud_data)) {
  vec<- as.character(ip_add[fraud_data$ip_address[i] >= ip_add$lower_bound_ip_address  &                              fraud_data$ip_address[i] <= ip_add$upper_bound_ip_address,                                "country"])
  if(length(vec) == 1){data_country[i] = vec}
}

fraud_data$country <- data_country

head(fraud_data)

##   user_id     signup_time  purchase_time purchase_value     device_id
## 1   22058 2/24/2015 22:55 4/18/2015 2:47             34 QVPSPJUOCKZAR
## 2  333320  6/7/2015 20:39  6/8/2015 1:38             16 EOGFQPIZPYXFZ
## 3    1359  1/1/2015 18:52 1/1/2015 18:52             15 YSSKYOSJHPPLJ
## 4  150084 4/28/2015 21:13 5/4/2015 13:54             44 ATGTXKYKUDUQN
## 5  221365  7/21/2015 7:09 9/9/2015 18:40             39 NAUITBZFJKHWW
## 6  159135  5/21/2015 6:03  7/9/2015 8:05             42 ALEYXFXINSXLZ
##   source browser sex age ip_address class       country
## 1    SEO  Chrome   M  39  732758369     0         Japan
## 2    Ads  Chrome   F  53  350311388     0 United States
## 3    SEO   Opera   M  53 2621473820     1 United States
## 4    SEO  Safari   M  41 3840542444     0          <NA>
## 5    Ads  Safari   M  45  415583118     0 United States
## 6    Ads  Chrome   M  18 2809315200     0        Canada

fraud_data$signup_time<- as.POSIXct(fraud_data$signup_time, format = "%m/%d/%Y %H:%M", tz = "GMT")

fraud_data$purchase_time<- as.POSIXct(fraud_data$purchase_time, format = "%m/%d/%Y %H:%M", tz = "GMT")

summary(as.factor(fraud_data$country))

##                   United States                           China 
##                           58049                           12038 
##                           Japan                  United Kingdom 
##                            7306                            4490 
##               Korea Republic of                         Germany 
##                            4162                            3646 
##                          France                          Canada 
##                            3161                            2975 
##                          Brazil                           Italy 
##                            2961                            1944 
##                       Australia                     Netherlands 
##                            1844                            1680 
##              Russian Federation                           India 
##                            1616                            1310 
## Taiwan; Republic of China (ROC)                          Mexico 
##                            1237                            1121 
##                          Sweden                           Spain 
##                            1090                            1027 
##                    South Africa                     Switzerland 
##                             838                             785 
##                          Poland                       Argentina 
##                             729                             661 
##                       Indonesia                          Norway 
##                             649                             609 
##                        Colombia                          Turkey 
##                             602                             568 
##                        Viet Nam                         Romania 
##                             550                             525 
##                         Denmark                       Hong Kong 
##                             490                             471 
##                         Finland                         Austria 
##                             460                             435 
##                         Ukraine                           Chile 
##                             429                             417 
##                         Belgium      Iran (ISLAMIC Republic Of) 
##                             409                             389 
##                           Egypt                  Czech Republic 
##                             359                             349 
##                        Thailand                     New Zealand 
##                             291                             278 
##                          Israel                    Saudi Arabia 
##                             272                             264 
##                       Venezuela                         Ireland 
##                             251                             240 
##                  European Union                          Greece 
##                             238                             231 
##                        Portugal                         Hungary 
##                             229                             211 
##                        Malaysia                       Singapore 
##                             210                             208 
##                        Pakistan                     Philippines 
##                             186                             177 
##                        Bulgaria                         Morocco 
##                             166                             158 
##                         Algeria                            Peru 
##                             122                             119 
##                         Tunisia            United Arab Emirates 
##                             118                             114 
##                         Ecuador                       Lithuania 
##                             106                              95 
##                      Seychelles                           Kenya 
##                              95                              93 
##                      Kazakhstan                      Costa Rica 
##                              92                              90 
##                          Kuwait                        Slovenia 
##                              90                              87 
##      Slovakia (SLOVAK Republic)                         Uruguay 
##                              86                              80 
##  Croatia (LOCAL Name: Hrvatska)                         Belarus 
##                              79                              72 
##                      Luxembourg                          Serbia 
##                              72                              69 
##                         Nigeria                          Latvia 
##                              67                              64 
##                          Panama                         Bolivia 
##                              62                              53 
##              Dominican Republic                          Cyprus 
##                              51                              43 
##                         Estonia                            Oman 
##                              42                              41 
##                      Bangladesh             Moldova Republic of 
##                              37                              37 
##                        Paraguay                         Georgia 
##                              35                              32 
##                       Sri Lanka          Bosnia and Herzegowina 
##                              31                              30 
##                     Puerto Rico                          Jordan 
##                              30                              28 
##                         Lebanon                     El Salvador 
##                              28                              25 
##                           Qatar                           Sudan 
##                              25                              25 
##                          Angola                       Macedonia 
##                              24                              24 
##            Syrian Arab Republic                      Azerbaijan 
##                              24                              23 
##                         Namibia                           Malta 
##                              23                              22 
##                         (Other)                            NA's 
##                             550                           21966

Exploratory Analysis

summary(fraud_data)

##     user_id        signup_time                 
##  Min.   :     2   Min.   :2015-01-01 00:00:00  
##  1st Qu.:100643   1st Qu.:2015-02-18 09:52:45  
##  Median :199958   Median :2015-04-19 04:41:00  
##  Mean   :200171   Mean   :2015-04-20 00:55:40  
##  3rd Qu.:300054   3rd Qu.:2015-06-18 14:47:15  
##  Max.   :400000   Max.   :2015-08-18 04:40:00  
##                                                
##  purchase_time                 purchase_value           device_id     
##  Min.   :2015-01-01 00:00:00   Min.   :  9.00   CQTUVBYIWWWBC:    20  
##  1st Qu.:2015-04-18 14:41:00   1st Qu.: 22.00   EQYVNEGOFLAWK:    20  
##  Median :2015-06-18 13:45:30   Median : 35.00   ITUMJCKWEYNDD:    20  
##  Mean   :2015-06-16 02:56:09   Mean   : 36.94   KIPFSCNUGOLDP:    20  
##  3rd Qu.:2015-08-17 18:48:00   3rd Qu.: 49.00   NGQCKIADMZORL:    20  
##  Max.   :2015-12-16 02:56:00   Max.   :154.00   ZUSVMDEZRBDTX:    20  
##                                                 (Other)      :150992  
##     source         browser      sex            age       
##  Ads   :59881   Chrome :61432   F:62819   Min.   :18.00  
##  Direct:30616   FireFox:24610   M:88293   1st Qu.:27.00  
##  SEO   :60615   IE     :36727             Median :33.00  
##                 Opera  : 3676             Mean   :33.14  
##                 Safari :24667             3rd Qu.:39.00  
##                                           Max.   :76.00  
##                                                          
##    ip_address            class           country         
##  Min.   :5.209e+04   Min.   :0.00000   Length:151112     
##  1st Qu.:1.086e+09   1st Qu.:0.00000   Class :character  
##  Median :2.155e+09   Median :0.00000   Mode  :character  
##  Mean   :2.152e+09   Mean   :0.09365                     
##  3rd Qu.:3.243e+09   3rd Qu.:0.00000                     
##  Max.   :4.295e+09   Max.   :1.00000                     
##

summary(ip_add)

##  lower_bound_ip_address upper_bound_ip_address               country     
##  Min.   :1.678e+07      Min.   :1.678e+07      United States     :46868  
##  1st Qu.:1.920e+09      1st Qu.:1.920e+09      Canada            : 6989  
##  Median :3.231e+09      Median :3.231e+09      Russian Federation: 6739  
##  Mean   :2.725e+09      Mean   :2.725e+09      Australia         : 6316  
##  3rd Qu.:3.350e+09      3rd Qu.:3.350e+09      Germany           : 5999  
##  Max.   :3.758e+09      Max.   :3.758e+09      United Kingdom    : 5401  
##                                                (Other)           :60534

sum(is.na(fraud_data$country))

## [1] 21966

We can see in fraud_data minimum value of ip_address is lower than the min of lower_bound_ip_address and max value of ip_address is higher than the max value of upper_bound_ip_address. and we have 21966 NAs in country of fraud_data.

Including Plots

We will explore the relation between gender and class to see if fraud has something to with gender.

counts <- table(fraud_data$class, fraud_data$sex)
barplot(counts, main="Gender and Class",
  xlab="Gender", col=c("darkblue","red"),
  legend = rownames(counts), beside=TRUE)

Here we see m ales make up more of the customer base of the company and thus are also involved in more of the fraud cases. It doesn’t look like one gender is more likely to participate in fraud though, just that one gender tends to use the site more in general.

Feature Engineering

To detect Fraud few points come to our observation like how much time difference is there between signup and purchase, then if divice id is unique etc.. We will create few new variables :

Time difference between sign-up time and purchase time

fraud_data$diff_time <- as.numeric(difftime( fraud_data$purchase_time , fraud_data$signup_time, units = "secs"))

plot diff_time vs class

fraud_data$class<- as.factor(fraud_data$class)
ggplot(fraud_data, aes(x = class, y = diff_time))+ geom_boxplot()

If the device id is unique or certain users are sharing the same device (many different user ids using the same device could be an indicator of fake accounts).

fraud_data <- fraud_data %>%
              group_by(device_id) %>%
              mutate (device_id_count = n())

 table( fraud_data$device_id_count,fraud_data$class)

##     
##           0      1
##   1  127771   4010
##   2    8214   2440
##   3     204     66
##   4       6     10
##   5      13     52
##   6      30    144
##   7      49    301
##   8      69    483
##   9      78    624
##   10     93    827
##   11    102   1009
##   12     88    992
##   13     65    767
##   14     58    740
##   15     47    568
##   16     38    538
##   17     14    207
##   18     11    169
##   19      5     90
##   20      6    114

Many different users having the same ip address could be an indicator of fake accounts.

fraud_data<- fraud_data %>%
             group_by(ip_address) %>%
             mutate(ip_address_count = n())

Usual week of the year and day of the week from time variables.

fraud_data$signup_weekday <- format(fraud_data$signup_time, "%A")
fraud_data$purchase_weekday <- format(fraud_data$purchase_time, "%A" )
fraud_data$signup_weekyear <- as.numeric(format(fraud_data$signup_time, "%U"))
fraud_data$purchase_weekyear <- as.numeric(format(fraud_data$purchase_time, "%U" ))

Now let us drop some unwanted variables.

fraud_data1<- fraud_data[,-c(1,2,3,5)]

fraud_data1$country[is.na(fraud_data1$country)]="Not_found"

fraud_data1$country = ifelse(fraud_data1$country %in% names(sort(table(fraud_data1$country),decreasing = TRUE))[51:length(unique(fraud_data1$country))],"Other", as.character(fraud_data1$country))

fraud_data1$class<-as.factor(fraud_data1$class)

fraud_data1$country<- as.factor(fraud_data1$country)

fraud_data1$signup_weekday<-as.factor(fraud_data1$signup_weekday)

fraud_data1$purchase_weekday<- as.factor(fraud_data1$purchase_weekday)

Preparing Data For Prediction

fraud_data1<- fraud_data1[complete.cases(fraud_data1),]

fraud_data1$country[is.na(fraud_data1$country)]="Not_found"






set.seed(12345)

inTrain<-createDataPartition(fraud_data1$class, p = 0.7, list = FALSE)

trainData<-fraud_data1[inTrain , ]

testData<-fraud_data1[-inTrain , ]

dim(trainData)

## [1] 105779     15

dim(testData)

## [1] 45333    15

We can remove the variables that are near zero variance

NZV <- nearZeroVar(trainData)
trainData1 <- trainData[, -NZV]
testData1  <- testData[, -NZV]
dim(trainData1)

## [1] 105779     14

dim(testData1)

## [1] 45333    14

Model Building

We will predict outome for our model by using 3 different techniques: 1. Classification Trees 2. Random Forest 3.Generalized Boosted Models

Prediction with Classification Trees

set.seed(11111)
model_tree<-rpart(class ~ . , data = trainData1, method = "class")
fancyRpartPlot(model_tree)

Now using this model we will see how this is performing on our Test Data .

predict_model_tree<- predict(model_tree, testData1, type = "class")
cm_tree<- confusionMatrix(predict_model_tree, testData$class )
cm_tree$table

##           Reference
## Prediction     0     1
##          0 41088  1995
##          1     0  2250

cm_tree$overall[1]

##  Accuracy 
## 0.9559923

So we can see the above model gives accuracy of 0.9285714 which gives us out-of-sample-error about .07.

Prediction with Random Forest

model_rf<-randomForest(class ~ . , data = trainData1)
print(model_rf)

## 
## Call:
##  randomForest(formula = class ~ ., data = trainData1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 4.35%
## Confusion matrix:
##       0    1  class.error
## 0 95807   66 0.0006884107
## 1  4535 5371 0.4578033515

Now using this model we will see how this is performing on our Test Data .

predict_model_rf<- predict(model_rf, testData1, type = "class")
cm_rf<-confusionMatrix(predict_model_rf,testData1$class)
cm_rf$table

##           Reference
## Prediction     0     1
##          0 41066  1982
##          1    22  2263

cm_rf$overall[1]

##  Accuracy 
## 0.9557938

So we can see the above model gives accuracy of 0.9557938 which gives us out-of-sample-error about .0442.

Prediction with Gradient Boosting Machine

model_gbm<- train(class ~ . , data = trainData1, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 1), verbose = FALSE)
print(model_gbm)

## Stochastic Gradient Boosting 
## 
## 105779 samples
##     13 predictor
##      2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 84623, 84622, 84623, 84624, 84624 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9569196  0.6802609
##   1                  100      0.9569291  0.6803869
##   1                  150      0.9569196  0.6803353
##   2                   50      0.9569007  0.6801950
##   2                  100      0.9569196  0.6804093
##   2                  150      0.9569196  0.6804093
##   3                   50      0.9568913  0.6801432
##   3                  100      0.9568440  0.6800314
##   3                  150      0.9568346  0.6801264
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.

Now using this model we will see how this is performing on our Test Data .

predict_model_gbm<- predict(model_gbm, testData1)
cm_gbm<-confusionMatrix(predict_model_gbm,testData1$class)
cm_gbm$table

##           Reference
## Prediction     0     1
##          0 41088  1995
##          1     0  2250

cm_gbm$overall[1]

##  Accuracy 
## 0.9559923

So we can see the above model gives accuracy of 0.9559923 which gives us out-of-sample-error about .00440 .

Conclusion

We have observed out of three models our most accurate model come out to be GBM model, so we will use this model to detect the fraud.

DATA CHALLENGE

priya malhotra

January 21, 2020

Introduction to Task

Tasks

Data1

Columns:

Data2

Columns:

Download Data

Task1

Exploratory Analysis

Including Plots

Feature Engineering

Preparing Data For Prediction

We can remove the variables that are near zero variance

Model Building

Prediction with Classification Trees

Prediction with Random Forest

Prediction with Gradient Boosting Machine

Conclusion