Project 1: Fraud Detection

2024-11-05

Getting Started Pt.1

For this analysis, the primary focus is on understanding and detecting fraudulent transactions. The dataset consists of various attributes related to financial transactions, specifically designed to facilitate fraud detection.

The dataset includes the following key columns: transaction_id: A unique identifier for each transaction.

customer_id: A unique identifier for the customer involved in the transaction.

card_number: The card number used for the transaction.

timestamp: The date and time when the transaction occurred.

Getting Started Pt.2

merchant_category: The category of the merchant where the transaction took place.

merchant_type: The type of merchant (e.g., online, in-person).

merchant: The name of the merchant.

amount: The monetary value of the transaction.

currency: The currency in which the transaction was made.

country: The country where the transaction occurred.

city: The city of the transaction.

city_size: The size of the city.

Getting Started Pt.3

card_type: The type of card used (e.g., credit, debit).

card_present: Indicates whether the card was present during the transaction (TRUE/FALSE).

device: The device used for the transaction (e.g., mobile, desktop).

channel: The channel through which the transaction was made (e.g., online, in-store).

device_fingerprint: A unique identifier for the device used.

ip_address: The IP address from which the transaction originated.

distance_from_home: The distance of the transaction location from the customer’s home address.

high_risk_merchant: Indicates whether the merchant is considered high-risk (TRUE/FALSE).

Getting Started Pt.4

transaction_hour: The hour when the transaction occurred.

weekend_transaction: Indicates whether the transaction took place on a weekend (TRUE/FALSE).

velocity_last_hour: The number of transactions made by the customer in the last hour.

is_fraud: A binary label indicating whether the transaction was fraudulent (TRUE/FALSE).

This dataset provides the necessary information to explore patterns and behaviors associated with fraudulent transactions.

The goal of this project is to perform exploratory data analysis, apply statistical procedures to identify significant factors related to fraud, and ultimately develop insights that can aid in fraud detection.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
data<-read.csv("/Users/solenne/Desktop/synthetic_fraud_data.csv")

colnames(data)

##  [1] "transaction_id"      "customer_id"         "card_number"        
##  [4] "timestamp"           "merchant_category"   "merchant_type"      
##  [7] "merchant"            "amount"              "currency"           
## [10] "country"             "city"                "city_size"          
## [13] "card_type"           "card_present"        "device"             
## [16] "channel"             "device_fingerprint"  "ip_address"         
## [19] "distance_from_home"  "high_risk_merchant"  "transaction_hour"   
## [22] "weekend_transaction" "velocity_last_hour"  "is_fraud"

To understand “normal” spending patterns within each merchant_category, I am calculating the average or median transaction amount per category.

# Calculate the average and median spend by merchant category
category_stats <- data %>%
  group_by(merchant_category) %>%
  summarise(
    avg_amount = mean(amount, na.rm = TRUE),
    median_amount = median(amount, na.rm = TRUE)
  )

head(category_stats)

## # A tibble: 6 × 3
##   merchant_category avg_amount median_amount
##   <chr>                  <dbl>         <dbl>
## 1 Education             45792.         1204.
## 2 Entertainment         29249.          800.
## 3 Gas                   46644.         1190.
## 4 Grocery               36201.         1039.
## 5 Healthcare            45654.         1167.
## 6 Restaurant            26484.          702.

##I am now joining category_stats back to the original dataset so each transaction has access to the average and median spend for its respective category

# Merge category stats with the original data
dataleft <- data %>%
  left_join(category_stats, by = "merchant_category")