In this project, I decided to go with a simulated data set created by Data scientist Kevin Kelue uploaded to Kaggle. This data set consists of credit card fraud throughout the United States with information regarding the merchant, state, date and time, credit card number, city, gender, occupation, zipcodes, and names of the recipients. I cleaned this data up by filtering the variables I was gonna use using filter(x) and excluding any I wasn’t gonna use using select (-). I chose this data set because I recently got a credit card and was curious to know how credit card fraud revolves, even if it’s simulated. I primarily focused on health and fitness products being used in fraud among women in Maryland and Virginia.
# Load Librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)library(ggplot2)library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
library(dplyr)
# Set working directoriessetwd("C:/Users/jfgam/Downloads/Data 101")fraudtest <-read_csv("fraud test.csv")
New names:
Rows: 555719 Columns: 23
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(12): trans_date_trans_time, merchant, category, first, last, gender, st... dbl
(11): ...1, cc_num, amt, zip, lat, long, city_pop, unix_time, merch_lat,...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Filtering data
# Convert data and variables fraudtest1 <- fraudtest |>filter(gender =="F") |>filter(state %in%c("MD", "VA")) |>filter(category =="health_fitness")
Cleaning the dataset
# Remove variables that won't be used cleantest <- fraudtest1 |>select(-trans_date_trans_time,-cc_num,-merchant,-street,-dob,-unix_time,-is_fraud,-zip,-city_pop,-trans_num,-merch_lat,-merch_long)head(cleantest)
# A tibble: 6 × 11
...1 category amt first last gender city state lat long job
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 259 health_fitness 44.9 Haley Wagn… F Anna… MD 39.0 -76.6 Acco…
2 330 health_fitness 14.6 Emily Hall F Basye VA 38.8 -78.8 Engi…
3 1124 health_fitness 64.9 Carol Dill… F Whal… MD 38.4 -75.3 Regu…
4 1189 health_fitness 48.1 Margaret Gibs… F Scot… MD 38.1 -76.3 Insu…
5 4751 health_fitness 2.59 Linda Davis F Gait… MD 39.2 -77.1 Clin…
6 5281 health_fitness 17.5 Alicia Hawk… F Harw… MD 38.9 -76.6 Quan…
Plot 1
# Create plot visualizationp1 <-ggplot(cleantest, aes(x = state, y= amt, color = state)) +geom_boxplot() +labs (title ="Credit Card fraud amount in Maryland and Virginia",caption ="SIMULATED DATASET BY KEVIN KELUE", x ="States",y ="Amount in $" ) +theme_minimal ()p1
In the first box plot, the plot demonstrated small averages in each state, with both having high outliers; Virginia had slightly higher fraud amounts which could be due to the fact that it is a slightly larger state. Nonetheless, both states had the same amounts relatively speaking.
In my finally map visualization, I constructed a popup tool that displayed personal information of each victim, which consisted of their occupation, full name, and the city they were in. The legend acted as a way to demonstrate how high each of the amounts ranged from which were from 50 to 300. What I noticed from this map is that the more severe/higher transaction amounted tended to have be in Virginia. Something that I wish I would’ve done differently is polish up my pop up information more, I struggled a lot with that despite the resources I had.
My resources: https://www.epirhandbook.com/en/new_pages/interactive_plots.html Class notes