The data set consists of 401,146 observations. Each observation has six attributes. These are:
* X: transaction identification number
* ID: salesperson identification number
* Val: sales dollar amount
* Prod: product id
* Quant: quantity
* Insp: fraud inspection status

The goal of this document is to predict fraudulent sales, so that we can avoid them. Out of the 401,146 sales in the dataset, 1,270 are fraduluent, 14,462 were clean sales. The remaining 385,414 have a status of unknown.
We will do the following:
1. We’ll temporarily remove the observations with status unknown under the attribute Insp
2. We’ll partition the known observations into a training and test sets
3. We’ll carry out a k-nearest neighbor classification algorithm , also known as kNN, to predict whether a sales was fraduluent or not based on the variables Val, Prod, and Quant.

Libaries used for this project:
* dplyr: for data wrangling * caret: for partitioning data/creating creating training and test sets * class: for carrying out the k-nearest neighbors algorithm.

suppressMessages(library(dplyr))
suppressMessages(library(caret))
suppressMessages(library(class))
suppressMessages(library(gmodels))

Data Preparation

Read code and comments.

# reads in the csv file
sales <- read.csv("~/Documents/datasets/Sales.csv")
# filters out the observations with NA's in  Quant and Val
sales <- sales %>%
  filter(is.na(Quant)==FALSE & is.na(Val)==FALSE)

# creates new data frame after filtering out the observations with unknonw status in Insp
sales_fk <- sales %>% filter(Insp != "unkn")
# removes p's from the Prod column
sales_fk$Prod <- gsub("p", "", sales_fk$Prod) %>% as.double()

#Quick peek at the data
head(sales_fk)
##    X  ID Prod Quant    Val Insp
## 1 53 v42   11 51097 310780   ok
## 2 56 v45   11   260   1925   ok
## 3 68 v42   11 51282 278770   ok
## 4 77 v50   11 46903 281485   ok
## 5 82 v46   12   475   2600   ok
## 6 84 v48   12   433   3395   ok
table(sales_fk$Insp)
## 
## fraud    ok  unkn 
##  1199 14347     0

Notice that out of the sales with known inspection status 8.35% are fradulent. Out models should improve this threshold. Below is the code for the partition.

# creates a training set and a testing set
set.seed(3456)
trainIndex <- createDataPartition(sales_fk$Insp, p = .8,
                                  list = FALSE,
                                  times = 1)
## Warning in createDataPartition(sales_fk$Insp, p = 0.8, list = FALSE, times
## = 1): Some classes have no records ( unkn ) and these will be ignored
Train <- sales_fk[ trainIndex,]
Test  <- sales_fk[-trainIndex,]

# Creates training and testing sets with just the variables of interest
Train.set <- Train %>% select(Prod, Quant, Val)
Test.set <- Test %>% select(Prod, Quant, Val)

# creates vectors with just the labels for both training and testing sets
Train.lab <- Train$Insp
Test.lab <- Test$Insp

The k-nearest neighbor algorithm uses the information of the training set and creates a model that is later used to predict labels for the test set. In the chunk of code below we use k = 2, this stands for two distinct outputs. Finally, we validate our model with the labels of the testing set, and we compute its accuracy. In a situation like this, overall accuracy is important, but it’s also important to know where we’re making mistakes. It’s more severe to let a fraud go unnoticed than it is to label a clean sale as fraudulent. Our current kNN model is 94% accurate. We predicted 131 of the 239, meaning we let 108 fradulent sales go unnoticed. I will look into improvements for a dataset of this sort.

fraud_pred <- knn(train = Train.set, test = Test.set, 
                 cl = Train.lab, k=2)

# Validates the prediction labels with the Test labels
CrossTable(x = Test.lab, y = fraud_pred, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3108 
## 
##  
##              | fraud_pred 
##     Test.lab |     fraud |        ok | Row Total | 
## -------------|-----------|-----------|-----------|
##        fraud |       131 |       108 |       239 | 
##              |     0.548 |     0.452 |     0.077 | 
##              |     0.642 |     0.037 |           | 
##              |     0.042 |     0.035 |           | 
## -------------|-----------|-----------|-----------|
##           ok |        73 |      2796 |      2869 | 
##              |     0.025 |     0.975 |     0.923 | 
##              |     0.358 |     0.963 |           | 
##              |     0.023 |     0.900 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       204 |      2904 |      3108 | 
##              |     0.066 |     0.934 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
# calculates accuracy
sum(1*(fraud_pred==Test.lab))/length(fraud_pred)
## [1] 0.9417632