Credit Card Fraud Data

About me

Ambitious Data Analyst with a year of experience in data manipulation and visualization, currently enriching data science expertise through a master’s program. Aims to drive business growth by translating complex data into actionable insights and efficient solutions. Dedicated to fostering informed decisions through advanced analytics and clear data storytelling.

Pursuing a Master’s degree in Data Science and Analytics at Grand Valley State University was a significant shift in my career. I honed my technical abilities and learned to use data strategically. The program included projects such as analyzing gun violence and crime statistics in Boston using R & Google Colab. I also built several Tableau dashboards for analyzing Amazon sales and global Netflix viewership. This education helped me become proficient in transforming complex data into understandable and useful insights.

As I advance in my career, I am keen to specialize in both Data Analysis and Data Engineering. With a strong enthusiasm for data exploration and a knack for meticulous data handling, I’m well-prepared to tackle complex data challenges. My aim is to identify unique patterns and forecast trends to help organizations make informed decisions. Equipped with these skills, I am eager to contribute significantly to projects where accurate, data-driven insights drive business success.

Course Objective:

Exploring Data Variability in Models:

Statistical modeling relies heavily on using distributions to estimate model parameters and assess uncertainty. In our work, we utilize these distributions to understand the variability in our data and make informed assumptions about our linear regression model’s parameters. Specifically, we suggest that the error terms are normally distributed with a zero mean and consistent variance, a common approach in linear regression analysis.

Using R for Statistical Modeling and Analysis

R is a powerful software for performing statistical analysis and creating various predictive models. In our project, we use R to build and evaluate different models, including linear regression. This setup helps us calculate important model details like regression coefficients. We also explore other models such as logistic regression and time series analysis using R’s specialized packages. Additionally, we use several R packages for organizing data, making visualizations, and checking the models, which makes our work more effective and reliable.

Clear Presentation of Statistical Findings

I employ visual tools such as scatter plots, regression diagnostics, and summary tables to demonstrate relationships between variables and outline the core results of the analysis. I also offer interpretations of the regression coefficients and discuss the real-world implications of our findings, connecting them back to the research question. After Skillfully conveying statistical results is crucial for making the information clear and useful to a broad audience. In this project, I use straightforward and concise language to share the main findings and insights, carefully avoiding technical terms to make complex concepts more accessible.

About the Project

Purpose :

The primary goal of this project is to detect fraudulent transactions using a dataset of credit card transactions. The dataset likely contains features related to each transaction, such as the amount, time, and possibly some obfuscated features to protect user privacy. The Class variable indicates whether a transaction is fraudulent.

Process:

Data Loading and Pre-processing:

Load the dataset from a CSV file.Perform initial explorations like checking for missing values, understanding the structure of the data, and observing distributions of key variables like transaction amounts.

Standardize the Amount variable to bring it to a similar scale as other features, which is important for many machine learning models.

Data Splitting:

Split the dataset into training and testing sets using a stratified sample based on the class label to ensure that both sets are representative of the overall dataset.

Model Development and Evaluation: Fit different models, including logistic regression, decision trees, and gradient boosting machines (GBM), to the training data.

Evaluate these models using the testing data, examining metrics like accuracy, precision, recall, and the confusion matrix.

Use plots (like decision tree plots, GBM performance, and logistic regression diagnostics) to visually assess model performance and diagnostics.

Model Tuning and Validation:

Predictions are made on both training and testing datasets to check for overfitting and to validate model performance.

The GBM model’s performance is fine-tuned by adjusting parameters like the number of trees, interaction depth, node size, learning rate, and bagging fraction.

Expected Outcomes:

The expected outcomes of this project would include:

Model Accuracy: Understanding how accurately each model can predict fraudulent transactions.

Insights into Data: Gaining insights into which features are most predictive of fraud, which can inform further feature engineering and data collection strategies.

Model Comparison: Comparing the performance of different models to choose the best performer for this specific task.

Operational Model: Developing a predictive model that could potentially be deployed in a real-world scenario to help detect and prevent credit card fraud.

# Loading packages

library(ranger)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(data.table)
library(caTools)
library(gbm)

## Loaded gbm 2.1.9

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

ranger: Used for fitting random forest models. caret: Provides functions for training and plotting statistical models, including data splitting and pre-processing. data.table: An extension of data.frame that provides enhancements in data manipulation and performance. caTools: Contains several basic utility functions including functions to split data. gbm: Used for fitting generalized boosted regression models.

### Loading the Data
df <- read.csv("C:/Users/ravit/OneDrive/Desktop/Projects/creditcard.csv")

### Exploring the Data
head(df)

This part loads the dataset named “creditcard.csv” into a data frame df and displays the first few rows using head(df) to get an initial look at the data format and variables.

tail(df)

This part of the code shows the last values of the dataset.

# check the missing values in the data

sum(is.na(df))

## [1] 0

#Exploring the data set

dim(df)

## [1] 284807     31

table(df$Class)

## 
##      0      1 
## 284315    492

summary(df$Amount)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.60    22.00    88.35    77.17 25691.16

names(df)

##  [1] "Time"   "V1"     "V2"     "V3"     "V4"     "V5"     "V6"     "V7"    
##  [9] "V8"     "V9"     "V10"    "V11"    "V12"    "V13"    "V14"    "V15"   
## [17] "V16"    "V17"    "V18"    "V19"    "V20"    "V21"    "V22"    "V23"   
## [25] "V24"    "V25"    "V26"    "V27"    "V28"    "Amount" "Class"

var(df$Amount)

## [1] 62560.07

Manipulating the Data

df$Amount=scale(df$Amount)
df_1=df[,-c(1)]
head(df_1)

Data Modeling

set.seed(123)
split = sample.split(df_1$Class,SplitRatio=0.80)
train_data = subset(df_1,split==TRUE)
test_data = subset(df_1,split==FALSE)
dim(train_data)

## [1] 227846     30

test_data = subset(df_1,split==FALSE)
dim(test_data)

## [1] 56961    30

Logistic Regression Model

lm=glm(Class~.,test_data,family=binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(lm)

## 
## Call:
## glm(formula = Class ~ ., family = binomial(), data = test_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -12.52800   10.30537  -1.216   0.2241  
## V1           -0.17299    1.27381  -0.136   0.8920  
## V2            1.44512    4.23062   0.342   0.7327  
## V3            0.17897    0.24058   0.744   0.4569  
## V4            3.13593    7.17768   0.437   0.6622  
## V5            1.49014    3.80369   0.392   0.6952  
## V6           -0.12428    0.22202  -0.560   0.5756  
## V7            1.40903    4.22644   0.333   0.7388  
## V8           -0.35254    0.17462  -2.019   0.0435 *
## V9            3.02176    8.67262   0.348   0.7275  
## V10          -2.89571    6.62383  -0.437   0.6620  
## V11          -0.09769    0.28270  -0.346   0.7297  
## V12           1.97992    6.56699   0.301   0.7630  
## V13          -0.71674    1.25649  -0.570   0.5684  
## V14           0.19316    3.28868   0.059   0.9532  
## V15           1.03868    2.89256   0.359   0.7195  
## V16          -2.98194    7.11391  -0.419   0.6751  
## V17          -1.81809    4.99764  -0.364   0.7160  
## V18           2.74772    8.13188   0.338   0.7354  
## V19          -1.63246    4.77228  -0.342   0.7323  
## V20          -0.69925    1.15114  -0.607   0.5436  
## V21          -0.45082    1.99182  -0.226   0.8209  
## V22          -1.40395    5.18980  -0.271   0.7868  
## V23           0.19026    0.61195   0.311   0.7559  
## V24          -0.12889    0.44701  -0.288   0.7731  
## V25          -0.57835    1.94988  -0.297   0.7668  
## V26           2.65938    9.34957   0.284   0.7761  
## V27          -0.45396    0.81502  -0.557   0.5775  
## V28          -0.06639    0.35730  -0.186   0.8526  
## Amount        0.22576    0.71892   0.314   0.7535  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1443.40  on 56960  degrees of freedom
## Residual deviance:  378.59  on 56931  degrees of freedom
## AIC: 438.59
## 
## Number of Fisher Scoring iterations: 17

Plotting the Results

plot(lm)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Applying model on training set data

lr_predict <- predict(lm,train_data, probability = TRUE)
cm = table(train_data[, 30], lr_predict > 0.5)
cm

##    
##      FALSE   TRUE
##   0 227406     46
##   1    116    278

lr_predict_test <- predict(lm,test_data, probability = TRUE)
cm = table(test_data[, 30], lr_predict_test > 0.5)
cm

##    
##     FALSE  TRUE
##   0 56860     3
##   1    32    66

###plotting decision tree

library(rpart)
library(rpart.plot)
decisionTree_model <- rpart(Class ~ . , df, method = 'class')
predicted_val <- predict(decisionTree_model, df, type = 'class')
probability <- predict(decisionTree_model, df, type = 'prob')
rpart.plot(decisionTree_model)

# Get the time to train the GBM model
system.time(
       model_gbm <- gbm(Class ~ .
               , distribution = "bernoulli"
               , data = rbind(train_data, test_data)
               , n.trees = 500
               , interaction.depth = 3
               , n.minobsinnode = 100
               , shrinkage = 0.01
               , bag.fraction = 0.5
               , train.fraction = nrow(train_data) / (nrow(train_data) + nrow(test_data))
)
)

##    user  system elapsed 
##   93.25    0.44  265.69

gbm.iter = gbm.perf(model_gbm, method = "test")

plot(model_gbm)

Conclusion:

Model Diversity : The use of multiple modeling techniques, including logistic regression, decision trees, and gradient boosting machines (GBM), is a strong strategy. This approach allows for comparing different models based on their performance metrics, such as accuracy and recall, which are crucial for imbalanced datasets like those typically found in fraud detection.

Data Handling : The careful preprocessing and manipulation of the data, such as scaling the Amount feature and removing irrelevant features, help in normalizing the dataset and potentially improve model performance. Additionally, the strategic split of data into training and testing sets ensures that the models are evaluated in a robust manner, simulating real-world predictions and validating the model’s ability to generalize.

Visualizations and Diagnostics :Plotting the results and diagnostics, such as the decision tree visualization and logistic regression diagnostics, aids in interpreting the models’ decisions and understanding their behavior. This is particularly useful for stakeholders who may need insights into how decisions are being made or for identifying areas where the model may be improved.

Scalability and Real-World Application : The project lays a foundation for developing a scalable fraud detection system that can be integrated into transaction processing pipelines to flag fraudulent activities in real time. However, it is important to continuously update and retrain models as new types of fraud emerge and as transaction patterns change over time.

Reflection of the course work:

During this course, I’ve used statistics to study real data, turning complex ideas into something practical. My main project looked at how trees in cities affect temperatures, blending environmental concerns with analyzing data. This experience sharpened my ability to understand data and reinforced the importance of making decisions based on evidence and using data to learn new things.

I’ve been exploring how rooftop gardens change temperatures. It’s been a lot of looking at numbers and figuring out patterns. This has helped me understand how cities and nature can work together better and solve some of our environmental problems. It’s like finding ways to build cities that are friendlier to the environment, and using data to make those decisions smarter.

Reflection of my participation:

Looking back on my time in this course, it feels like a shared adventure guided by our professor rather than just a learning journey. My role stretched beyond the classroom, allowing me to witness theories come to life in practical ways.

Throughout the course, my interactions with statistical models and regression analyses were like lively conversations rather than just class exercises. Our discussions went beyond simple Q&A sessions, diving deep into the complexities of the material we were studying. With our professor’s help, these concepts became clearer, especially when we applied them to real-world data. This sparked my curiosity even more, driving me to explore further.

Talking and working with my classmates was a big part of this course for me. We didn’t just do projects and help each other because we had to. It was about sharing our ideas to make our work better. When we all got together to talk about our ideas, it showed me how much we could do when we worked as a team. Bringing in what I knew about cloud technology gave us new ways to think about our projects, mixing my experience with what we were learning in class.

I tackled my studies head-on, diving into tools like Tableau, Python, and R programming with gusto. Each assignment was a chance to get better, whether I was looking at trends in the market or understanding how people shop. The tough work pushed me to do my best, and you can see it in how carefully I did my projects and how much detail I put into my analyses.

Overall, my experience in the course was marked by active participation, a curious mind, and collaboration. I crafted this chapter of my academic journey with dedication and a genuine appreciation for the field of statistics, and it’s a chapter I’ll cherish warmly in my memories.

Future Work:

Potential avenues for future exploration could involve experimenting with advanced data analytics techniques, such as neural networks and ensemble methods, to potentially enhance risk assessment and fraud detection in financial transactions. Additionally, incorporating additional variables such as customer behavior patterns and macroeconomic indicators could enrich the model’s explanatory power and predictive accuracy. Ultimately, this study contributes valuable insights into financial risk management and sets the groundwork for further innovation and refinement in banking and financial analysis.