PallaviSaitu_ANLY540

Objective

Project Objective: This project will alow to detect credit card fraud by looking at the previous data patterns. The project focusses on data import, data exploration, data manipulation, data modeling, data fitting logistic regression and decision tree modeling.

Introduction

According to an artical by the ascent, “47% of global credit card fraud occurs here in the United States.” (How Do Credit Card Companies Spot Fraud?,Nov 20,2018). The are so many scams and frauds nowadays as technology evolves around us that credit card fraud has to be taken seriously. There are banks who invest a lot of their money and time in order to detect fraud. Furthermore, the article also states that, “it’s estimated that losses from credit card fraud exceeded $24 billion in 2016.” The key way of detecting credit card fraud is by looking at any unusual activitiy of money transactions, amount of money spent (in-person,online) etc. At the end of a fraud, someone innocient looses money, hence it is extremly critical to detect credit card frauds.

77% of the credit card fraud is done through phone, where people click on spam messages/links which share their credit card information or by spam calls where people ask for your credit card information by trapping you in a fraud like saying you won a prize, you have payments due, calling on behalf of a govt organization etc. Moreover, there are various types of credit card frauds like: 1. Application fraud (apply for a new credit card) 2. Electronic or manual credit card imprints (fraud transactions or fake card by using credit card imprints), 3. Card not present fraud (know account no and expiry date to claim cnp) 4. Counterfeit Card Fraud (fake magnetic swipe cards holding your details) 5. Lost and Stolen Card Fraud (using someones card that was stolen/lost) 6. Card ID Theft (details of card known by a criminal) 7. Assumed Identity (false information to obtain a credit card) 8. Mail Non-Receipt Card Fraud (criminals using your newly ordered/renewed card) 9. Doctored Cards (card strip will not work but the card details would be changed to enter manually into payment systems) 10. Account Takeover (get relevant information and account documents) 11. Fake Cards (create fake cards by using advance technology)

Hence, it is necessary to detect credit card frauds for all of the above types.

Hypothesis / Problem Statement

Credit card fraud is deceted by detecting malicious activity of amount spent within a specific time and other user information variables.

Statistical Analysis Plan

The Predictive analysis, fraud analysis, outlier models, custom rule management, global profiling, mobile card controls are some of the ways to look for a credit card solution. In this project, I’ve considered predictive analysis using training and test data, logistic regression model and decision tree model.

Method - Data - Variables

The libraries used are mentioned here. The data is imported using read.csv and the first 5 records are shown using the head function.The dependent variable here is Class, which denotes if the activity is fraud(Class=1) or no-fraud(Class=0). The independent variables are V1-V24, Time and Amount.

#install.packages("pROC")
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

library(caTools)

## Warning: package 'caTools' was built under R version 3.5.2

library(pROC)

## Warning: package 'pROC' was built under R version 3.5.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

#install.packages("rpart.plot")
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.5.2

## Loading required package: rpart

## Warning: package 'rpart' was built under R version 3.5.2

# Load dataset 
project_dataset <- read.csv("/Users/pallavisaitu/Downloads/creditcard.csv")
# View first 5 records
head(project_dataset,6)

##   Time         V1          V2        V3         V4          V5          V6
## 1    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5    2 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146
## 6    2 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
##            V7          V8         V9         V10        V11         V12
## 1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
## 6  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
##          V13        V14        V15        V16         V17         V18
## 1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
## 2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
## 3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
## 4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
## 5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
##           V19         V20          V21          V22         V23
## 1  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802
## 3 -2.26185710  0.52497973  0.247998153  0.771679402  0.90941226
## 4 -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052
## 5  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808
## 6 -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767
##           V24        V25        V26          V27         V28 Amount Class
## 1  0.06692807  0.1285394 -0.1891148  0.133558377 -0.02105305 149.62     0
## 2 -0.33984648  0.1671704  0.1258945 -0.008983099  0.01472417   2.69     0
## 3 -0.68928096 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66     0
## 4 -1.17557533  0.6473760 -0.2219288  0.062722849  0.06145763 123.50     0
## 5  0.14126698 -0.2060096  0.5022922  0.219422230  0.21515315  69.99     0
## 6 -0.37142658 -0.2327938  0.1059148  0.253844225  0.08108026   3.67     0

# Data Exploration
dim(project_dataset)

## [1] 284807     31

table(project_dataset$Class)

## 
##      0      1 
## 284315    492

summary(project_dataset$Amount)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.60    22.00    88.35    77.17 25691.16

str(project_dataset)

## 'data.frame':    284807 obs. of  31 variables:
##  $ Time  : num  0 0 1 1 2 2 4 7 7 9 ...
##  $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
##  $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
##  $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
##  $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
##  $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
##  $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
##  $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
##  $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
##  $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
##  $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
##  $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
##  $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
##  $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
##  $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
##  $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
##  $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
##  $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
##  $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
##  $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
##  $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
##  $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
##  $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
##  $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
##  $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
##  $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
##  $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
##  $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
##  $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
##  $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
##  $ Class : int  0 0 0 0 0 0 0 0 0 0 ...

The dataset contains, 31 variables of Time, Amount, Class and V1-V28 to measure fraud. Class = 0 means no fraud and 1 means fraud. There are 284807 rows of the data here. The dataset contains of integers and numbers.

plot(project_dataset$Amount)

The plot above shows the wide spread of Amount users have spent. As we can see in this plot, there are a few outliers here which could be detected as fraud later on in the models.

Statistical Analysis Results

Analyze the data given your statistical plan. Report the appropriate statistics for that analysis (see lecture notes). Include figures! Include the R-chunks so we can see the analyses you ran and output from the study. Note what you are doing in each step.

After the dataset is standardized, we create a subset for our model.

project_dataset$Amount=scale(project_dataset$Amount)
NewData=project_dataset[,-c(1)]
head(NewData)

##           V1          V2        V3         V4          V5          V6
## 1 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146
## 6 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
##            V7          V8         V9         V10        V11         V12
## 1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
## 6  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
##          V13        V14        V15        V16         V17         V18
## 1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
## 2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
## 3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
## 4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
## 5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
##           V19         V20          V21          V22         V23
## 1  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802
## 3 -2.26185710  0.52497973  0.247998153  0.771679402  0.90941226
## 4 -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052
## 5  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808
## 6 -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767
##           V24        V25        V26          V27         V28      Amount
## 1  0.06692807  0.1285394 -0.1891148  0.133558377 -0.02105305  0.24496383
## 2 -0.33984648  0.1671704  0.1258945 -0.008983099  0.01472417 -0.34247394
## 3 -0.68928096 -0.3276418 -0.1390966 -0.055352794 -0.05975184  1.16068389
## 4 -1.17557533  0.6473760 -0.2219288  0.062722849  0.06145763  0.14053401
## 5  0.14126698 -0.2060096  0.5022922  0.219422230  0.21515315 -0.07340321
## 6 -0.37142658 -0.2327938  0.1059148  0.253844225  0.08108026 -0.33855582
##   Class
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0

I have split the dataset into training set as well as test set. Here, the split ratio is 0.80. This means that 80% of our data will be training data whereas the other 20% will be used as test data. I will then find the dimensions using the dim() function.

set.seed(123)
data_sample = sample.split(NewData$Class,SplitRatio=0.80)
train_data = subset(NewData,data_sample==TRUE)
test_data = subset(NewData,data_sample==FALSE)

# Training data
dim(train_data)

## [1] 227846     30

# Test data
dim(test_data)

## [1] 56961    30

Here, I will fit the first model. A logistic regression is used for modeling the outcome probability of a class such as pass/fail, positive/negative and in our case – fraud/not-fraud. The summary function shows the model summary results, we can also see the residual vs fitted plot, normal Q-Q plot, scale location plot and residual vs leverage plot.

Logistic_Model=glm(Class~.,test_data,family=binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(Logistic_Model)

## 
## Call:
## glm(formula = Class ~ ., family = binomial(), data = test_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9019  -0.0254  -0.0156  -0.0078   4.0877  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -12.52800   10.30537  -1.216   0.2241  
## V1           -0.17299    1.27381  -0.136   0.8920  
## V2            1.44512    4.23062   0.342   0.7327  
## V3            0.17897    0.24058   0.744   0.4569  
## V4            3.13593    7.17768   0.437   0.6622  
## V5            1.49014    3.80369   0.392   0.6952  
## V6           -0.12428    0.22202  -0.560   0.5756  
## V7            1.40903    4.22644   0.333   0.7388  
## V8           -0.35254    0.17462  -2.019   0.0435 *
## V9            3.02176    8.67262   0.348   0.7275  
## V10          -2.89571    6.62383  -0.437   0.6620  
## V11          -0.09769    0.28270  -0.346   0.7297  
## V12           1.97992    6.56699   0.301   0.7630  
## V13          -0.71674    1.25649  -0.570   0.5684  
## V14           0.19316    3.28868   0.059   0.9532  
## V15           1.03868    2.89256   0.359   0.7195  
## V16          -2.98194    7.11391  -0.419   0.6751  
## V17          -1.81809    4.99764  -0.364   0.7160  
## V18           2.74772    8.13188   0.338   0.7354  
## V19          -1.63246    4.77228  -0.342   0.7323  
## V20          -0.69925    1.15114  -0.607   0.5436  
## V21          -0.45082    1.99182  -0.226   0.8209  
## V22          -1.40395    5.18980  -0.271   0.7868  
## V23           0.19026    0.61195   0.311   0.7559  
## V24          -0.12889    0.44701  -0.288   0.7731  
## V25          -0.57835    1.94988  -0.297   0.7668  
## V26           2.65938    9.34957   0.284   0.7761  
## V27          -0.45396    0.81502  -0.557   0.5775  
## V28          -0.06639    0.35730  -0.186   0.8526  
## Amount        0.22576    0.71892   0.314   0.7535  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1443.40  on 56960  degrees of freedom
## Residual deviance:  378.59  on 56931  degrees of freedom
## AIC: 438.59
## 
## Number of Fisher Scoring iterations: 17

plot(Logistic_Model)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Here, I have used the ROC curve to to assess the performance of the model. ROC is also known as Receiver Optimistic Characteristics.

lr.predict <- predict(Logistic_Model,test_data, probability = TRUE)
auc.gbm = roc(test_data$Class, lr.predict, plot = TRUE, col = "blue")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

The decision tree model is used here to know the patterns of fraud activitis in the dataset.

decisionTree_model <- rpart(Class ~ . , project_dataset, method = 'class')
predicted_val <- predict(decisionTree_model, project_dataset, type = 'class')
probability <- predict(decisionTree_model, project_dataset, type = 'prob')
rpart.plot(decisionTree_model)

### Interpret and Discuss

Summarize the results from your study in as plain of language as possible. How does this relate to previous literature? Where the results supportive of your hypotheses? What have we learned from you doing this analysis/study?

The models show us that there are some fraud activities recorded in the dataset. These fraud activities were detected based on the time, amount, class, V1-V24. The variables V1-V24 contains sensitive information about the users. The decision tree model showed the similarities and patterns of fraud (class=1) and no-fraud (class=0) activities.

The factors that play an important role to detect fraud are: if V16 >= 2.8, V10>= -1.8, V14>=-4.7, V14>=-8.1 and V17>=-2.8 then the activity is predicted to be fraud. else if V14>=-8.1 and V17>=-2.8 then the activity is predicted to be fraud. It detected 0.172% fraud activities. This model would reduces the amount of loss users face.

References

6 things to look for in a credit card fraud detection solution. (2019, July 30). Retrieved from https://www.worldpay.com/en-us/insights-hub/article/6-things-to-look-for-in-a-credit-card-fraud-detection-solution.

Hadkar, A., & Yewale, S. (2015, May 2). Online Credit Card Fraud Detection. Retrieved from https://www.ijream.org/papers/INJRV01I02004.pdf.

How Do Credit Card Companies Spot Fraud? (2018, November 20). Retrieved from https://www.fool.com/the-ascent/credit-cards/articles/how-do-credit-card-companies-spot-fraud/.

Miller, A. (2019, May 22). The Best Ways To Prevent Credit Card Fraud & Theft. Retrieved from https://upgradedpoints.com/how-to-prevent-credit-card-fraud/.

Bennett, M. (2018, September 28). 11 Common Types Of Credit Card Scams & Fraud. Retrieved from https://www.consumerprotect.com/crime-fraud/11-types-of-credit-card-fraud-scams/.

PallaviSaitu_ANLY540_FinalProject

Pallavi Saitu

11/28/2019