Project2

Project 2 - Credit Card Fraud (SIMULATION)

SOURCE: https://www.massdefense.com/credit-card-theft-in-massachusetts/

Introduction

In this project, I decided to go with a simulated data set created by Data scientist Kevin Kelue uploaded to Kaggle. This data set consists of credit card fraud throughout the United States with information regarding the merchant, state, date and time, credit card number, city, gender, occupation, zipcodes, and names of the recipients. I cleaned this data up by filtering the variables I was gonna use using filter(x) and excluding any I wasn’t gonna use using select (-). I chose this data set because I recently got a credit card and was curious to know how credit card fraud revolves, even if it’s simulated. I primarily focused on health and fitness products being used in fraud among women in Maryland and Virginia.

# Load Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(ggplot2)
library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
library(dplyr)
# Set working directories
setwd("C:/Users/jfgam/Downloads/Data 101")
fraudtest <- read_csv("fraud test.csv")
New names:
Rows: 555719 Columns: 23
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(12): trans_date_trans_time, merchant, category, first, last, gender, st... dbl
(11): ...1, cc_num, amt, zip, lat, long, city_pop, unix_time, merch_lat,...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Filtering data

# Convert data and variables

 fraudtest1 <- fraudtest |>
  filter(gender == "F") |>
  filter(state %in% c("MD", "VA")) |>
  filter(category == "health_fitness")

Cleaning the dataset

# Remove variables that won't be used
 cleantest <- fraudtest1 |>
  select(-trans_date_trans_time,-cc_num,-merchant,-street,-dob,-unix_time,-is_fraud,-zip,-city_pop,-trans_num,-merch_lat,-merch_long)

head(cleantest)
# A tibble: 6 × 11
   ...1 category         amt first    last  gender city  state   lat  long job  
  <dbl> <chr>          <dbl> <chr>    <chr> <chr>  <chr> <chr> <dbl> <dbl> <chr>
1   259 health_fitness 44.9  Haley    Wagn… F      Anna… MD     39.0 -76.6 Acco…
2   330 health_fitness 14.6  Emily    Hall  F      Basye VA     38.8 -78.8 Engi…
3  1124 health_fitness 64.9  Carol    Dill… F      Whal… MD     38.4 -75.3 Regu…
4  1189 health_fitness 48.1  Margaret Gibs… F      Scot… MD     38.1 -76.3 Insu…
5  4751 health_fitness  2.59 Linda    Davis F      Gait… MD     39.2 -77.1 Clin…
6  5281 health_fitness 17.5  Alicia   Hawk… F      Harw… MD     38.9 -76.6 Quan…

Plot 1

# Create plot visualization

p1 <- ggplot(cleantest, aes(x = state, y= amt, color = state)) +
  geom_boxplot() +
  labs (title = "Credit Card fraud amount in Maryland and Virginia",
        caption = "SIMULATED DATASET BY KEVIN KELUE", 
        x = "States",
        y = "Amount in $"
  ) +
  theme_minimal ()
p1

Map Visulization

leaflet() |>
  setView(lng = -76.623, lat = 38.4214, zoom = 6) |>
  addProviderTiles("Esri.NatGeoWorldMap") |>
  addCircles(
    lat = ~lat,
    lng = ~long,
    data = cleantest,
    color = "red",
    radius = cleantest$amt*50,
    fillOpacity = .7)

Setup code for popups and colors

# Create popups

popupfraud <- paste0(
  "<b>Occupation: <\b>", cleantest$job, "<br>",
  "<b>First Name: <\b>", cleantest$first, "<br>",
  "<b>Last Name: <\b>", cleantest$last, "<br>",
  "<b>City: <\b>", cleantest$city, "<br>" )
popup=popupfraud

Recreate Map with popups and extra details

pal <- colorNumeric(
  palette = "YlOrRd",
  domain = cleantest$amt

)

leaflet() |>
  setView(lng = -76.623, lat = 38.4214, zoom = 6) |>
  addProviderTiles("Esri.NatGeoWorldMap") |>
  addCircles(
    lat = ~lat,
    lng = ~long,
    data = cleantest,
    color = ~pal(amt),
    radius = cleantest$amt*30,
    fillOpacity = .7,
    popup= ~ popupfraud
  ) |>
  addLegend(
    position = "topright",
    pal = pal,
    values = cleantest$amt,
    title = "Fraud Amount",
    opacity = .8
  )  

Conclusions

In the first box plot, the plot demonstrated small averages in each state, with both having high outliers; Virginia had slightly higher fraud amounts which could be due to the fact that it is a slightly larger state. Nonetheless, both states had the same amounts relatively speaking.

In my finally map visualization, I constructed a popup tool that displayed personal information of each victim, which consisted of their occupation, full name, and the city they were in. The legend acted as a way to demonstrate how high each of the amounts ranged from which were from 50 to 300. What I noticed from this map is that the more severe/higher transaction amounted tended to have be in Virginia. Something that I wish I would’ve done differently is polish up my pop up information more, I struggled a lot with that despite the resources I had.

My resources: https://www.epirhandbook.com/en/new_pages/interactive_plots.html Class notes