The data set consists of 401,146 observations. Each observation has six attributes. These are:
* X: transaction identification number
* ID: salesperson identification number
* Val: sales dollar amount
* Prod: product id
* Quant: quantity
* Insp: fraud inspection status
The goal of this document is to predict fraudulent sales, so that we can avoid them. Out of the 401,146 sales in the dataset, 1,270 are fraduluent, 14,462 were clean sales. The remaining 385,414 have a status of unknown.
We will do the following:
1. We’ll temporarily remove the observations with status unknown under the attribute Insp
2. We’ll partition the known observations into a training and test sets
3. We’ll carry out a k-nearest neighbor classification algorithm , also known as kNN, to predict whether a sales was fraduluent or not based on the variables Val, Prod, and Quant.
Libaries used for this project:
* dplyr: for data wrangling * caret: for partitioning data/creating creating training and test sets * class: for carrying out the k-nearest neighbors algorithm.
suppressMessages(library(dplyr))
suppressMessages(library(caret))
suppressMessages(library(class))
suppressMessages(library(gmodels))
Read code and comments.
# reads in the csv file
sales <- read.csv("~/Documents/datasets/Sales.csv")
# filters out the observations with NA's in Quant and Val
sales <- sales %>%
filter(is.na(Quant)==FALSE & is.na(Val)==FALSE)
# creates new data frame after filtering out the observations with unknonw status in Insp
sales_fk <- sales %>% filter(Insp != "unkn")
# removes p's from the Prod column
sales_fk$Prod <- gsub("p", "", sales_fk$Prod) %>% as.double()
#Quick peek at the data
head(sales_fk)
## X ID Prod Quant Val Insp
## 1 53 v42 11 51097 310780 ok
## 2 56 v45 11 260 1925 ok
## 3 68 v42 11 51282 278770 ok
## 4 77 v50 11 46903 281485 ok
## 5 82 v46 12 475 2600 ok
## 6 84 v48 12 433 3395 ok
table(sales_fk$Insp)
##
## fraud ok unkn
## 1199 14347 0
Notice that out of the sales with known inspection status 8.35% are fradulent. Out models should improve this threshold. Below is the code for the partition.
# creates a training set and a testing set
set.seed(3456)
trainIndex <- createDataPartition(sales_fk$Insp, p = .8,
list = FALSE,
times = 1)
## Warning in createDataPartition(sales_fk$Insp, p = 0.8, list = FALSE, times
## = 1): Some classes have no records ( unkn ) and these will be ignored
Train <- sales_fk[ trainIndex,]
Test <- sales_fk[-trainIndex,]
# Creates training and testing sets with just the variables of interest
Train.set <- Train %>% select(Prod, Quant, Val)
Test.set <- Test %>% select(Prod, Quant, Val)
# creates vectors with just the labels for both training and testing sets
Train.lab <- Train$Insp
Test.lab <- Test$Insp
The k-nearest neighbor algorithm uses the information of the training set and creates a model that is later used to predict labels for the test set. In the chunk of code below we use k = 2, this stands for two distinct outputs. Finally, we validate our model with the labels of the testing set, and we compute its accuracy. In a situation like this, overall accuracy is important, but it’s also important to know where we’re making mistakes. It’s more severe to let a fraud go unnoticed than it is to label a clean sale as fraudulent. Our current kNN model is 94% accurate. We predicted 131 of the 239, meaning we let 108 fradulent sales go unnoticed. I will look into improvements for a dataset of this sort.
fraud_pred <- knn(train = Train.set, test = Test.set,
cl = Train.lab, k=2)
# Validates the prediction labels with the Test labels
CrossTable(x = Test.lab, y = fraud_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3108
##
##
## | fraud_pred
## Test.lab | fraud | ok | Row Total |
## -------------|-----------|-----------|-----------|
## fraud | 131 | 108 | 239 |
## | 0.548 | 0.452 | 0.077 |
## | 0.642 | 0.037 | |
## | 0.042 | 0.035 | |
## -------------|-----------|-----------|-----------|
## ok | 73 | 2796 | 2869 |
## | 0.025 | 0.975 | 0.923 |
## | 0.358 | 0.963 | |
## | 0.023 | 0.900 | |
## -------------|-----------|-----------|-----------|
## Column Total | 204 | 2904 | 3108 |
## | 0.066 | 0.934 | |
## -------------|-----------|-----------|-----------|
##
##
# calculates accuracy
sum(1*(fraud_pred==Test.lab))/length(fraud_pred)
## [1] 0.9417632