Description

This report describes Banknotes Authentication prediction using classification model. The dataset used in this project is the banknotes authentication data from UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/banknote+authentication

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object grayscale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

The dataset contains a total of 1372 records of different banknotes. The four left columns are data that we can use to predict whether a note is genuine or counterfeit, which is external data provided by a human, coded as 0 and 1. Where 0 represents real and 1 represents fake banknote. Machine learning algorithms require data where features and labels are separated from each other. The label means the output class or output category. In our dataset, variance, skewness, curtosis, and entropy are features whereas the class column contains the label.

Report Outline:

  1. Data Extraction
  2. Exploratory Data Analysis
  3. Data Preparation
  4. Modelling
  5. Evaluation
  6. Result and Recommendation

1. Data Extraction

Extract Data

Extract data in url format into dataframe in R

loc <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/"
ds <- "data_banknote_authentication.txt"
url <- paste(loc, ds, sep ="")

bn_df <- read.table(url, sep = ",")

Change Variables Name

Change variable name to variance, skewness, curtosis, entropy, class.

names(bn_df) <- c("variance", "skewness", "curtosis", "entropy", "class")

Stucture of Data

See the data structure of dataframe. There are 1272 onservations and 5 variables.

str(bn_df)
## 'data.frame':    1372 obs. of  5 variables:
##  $ variance: num  3.622 4.546 3.866 3.457 0.329 ...
##  $ skewness: num  8.67 8.17 -2.64 9.52 -4.46 ...
##  $ curtosis: num  -2.81 -2.46 1.92 -4.01 4.57 ...
##  $ entropy : num  -0.447 -1.462 0.106 -3.594 -0.989 ...
##  $ class   : int  0 0 0 0 0 0 0 0 0 0 ...

It is evident from the output that our dataset has four features: variance, skewness, curtosis and entropy. While the class refers to whether or not the banknote is real or fake.

The variables are:
1. Variance finds how each pixel varies from the neighboring pixels and classifies them into different regions
2. Skewness is the measure of the lack of symmetry
3. Curtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution
4. Entropy is a quantity which is used to describe the amount of information which must be coded for, by a compression algorithm
5. Class contains two values 0 representing fake note and 1 representing fake note

Summary of Data

To see the statistical details of the data, the summary() function can be used, which returns the min, mean, quartile information and maximum values for each column.

summary(bn_df)
##     variance          skewness          curtosis          entropy       
##  Min.   :-7.0421   Min.   :-13.773   Min.   :-5.2861   Min.   :-8.5482  
##  1st Qu.:-1.7730   1st Qu.: -1.708   1st Qu.:-1.5750   1st Qu.:-2.4135  
##  Median : 0.4962   Median :  2.320   Median : 0.6166   Median :-0.5867  
##  Mean   : 0.4337   Mean   :  1.922   Mean   : 1.3976   Mean   :-1.1917  
##  3rd Qu.: 2.8215   3rd Qu.:  6.815   3rd Qu.: 3.1793   3rd Qu.: 0.3948  
##  Max.   : 6.8248   Max.   : 12.952   Max.   :17.9274   Max.   : 2.4495  
##      class       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4446  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

We can see that values vary with different means and standard deviations, perhaps some normalization or standardization would be required prior to modeling.

Change Value of Class

Change class variable to Factor, 0 is real and 1 is fake.

bn_df$class <- factor(bn_df$class,
                          levels = c(0,1),
                          labels = c("real","fake"))

Check for Missing Value

The data doesn’t have missing value

colSums(is.na(bn_df))
## variance skewness curtosis  entropy    class 
##        0        0        0        0        0

Check for Proportion of Target Variable

prop.table(table(bn_df$class))
## 
##      real      fake 
## 0.5553936 0.4446064

Total of real money is 762 (0.55%) and fake money is 610 (45%).

2. Exploratory Data Analysis

2.1 Univariate Analysis

Distribution Plot

A histogram plot is then created for each variable.

#variance
h1 <- ggplot(bn_df, aes(x=variance)) + 
  geom_histogram(aes(y=..density..),
                 binwidth=.5,
                 colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")
#skewness
h2 <- ggplot(bn_df, aes(x=skewness)) + 
  geom_histogram(aes(y=..density..),
                 binwidth=.5,
                 colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")  
#curtosis
h3 <- ggplot(bn_df, aes(x=curtosis)) + 
  geom_histogram(aes(y=..density..),
                 binwidth=.5,
                 colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")  
#entropy
h4 <- ggplot(bn_df, aes(x=entropy)) + 
  geom_histogram(aes(y=..density..),
                 binwidth=.5,
                 colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")  # Overlay with transparent density plot

grid.arrange(h1, h2 , h3, h4, ncol = 2)

We can see that perhaps the first two variables have a Gaussian-like distribution and the next two input variables may have a skewed Gaussian distribution or an exponential distribution. They may be outliers in entropy and curtosis varibale and data is not normalized. To be sure, we’re going to see at using the boxplot.

Boxplot

p21 <- ggplot(bn_df, aes(x = class, y = variance)) +
  geom_boxplot()

p22 <- ggplot(bn_df, aes(x = class, y = skewness)) +
  geom_boxplot()

p23 <- ggplot(bn_df, aes(x = class, y = curtosis)) +
  geom_boxplot()

p24 <- ggplot(bn_df, aes(x = class, y = entropy)) +
  geom_boxplot()

grid.arrange(p21, p22 , p23, p24, ncol = 2)

The entropy and curtosis variable have an outlier. This should be handled or cleaned in the data preparation step. The examining the proportion of target variable (class variable) is balance.

2.1 Bivariate Analysis

p1 <- ggplot(bn_df, aes(x = variance, y = skewness, color = class)) + 
  geom_point()

p2 <- ggplot(bn_df, aes(x = variance, y = curtosis, color = class)) + 
  geom_point()

p3 <- ggplot(bn_df, aes(x = skewness, y = curtosis, color = class)) + 
  geom_point()

p4 <- ggplot(bn_df, aes(x = variance, y = entropy, color = class)) + 
  geom_point()

p5 <- ggplot(bn_df, aes(x = skewness, y = entropy, color = class)) + 
  geom_point()

p6 <- ggplot(bn_df, aes(x = curtosis, y = entropy, color = class)) + 
  geom_point()

grid.arrange(p1, p2 , p3, p4, p5, p6, ncol = 2)

It is visible from the output that entropy and variance have a slight linear correlation. Similarly, there is an inverse linear correlation between the curtosis and skewness. Finally, we can see that the values for curtosis and entropy are slightly higher for real banknotes, while the values for skewness and variance are higher for the fake banknotes.

2.1 Multivariate Analysis

Compute and vizualize Pearson’s Correlation Coefficient (R) between

cor(bn_df[1:4])
##            variance   skewness   curtosis    entropy
## variance  1.0000000  0.2640255 -0.3808500  0.2768167
## skewness  0.2640255  1.0000000 -0.7868952 -0.5263208
## curtosis -0.3808500 -0.7868952  1.0000000  0.3188409
## entropy   0.2768167 -0.5263208  0.3188409  1.0000000
chart.Correlation(bn_df[1:4], histogram=TRUE, pch=19)

From the pair plots of the attributes, we infer the following:

  1. The variance of both the classes is approximately normally distributed.
  2. Both the classes appear right skeweness.
  3. The curtosis of class 0 is approximately normally distributed and that of class 1 is right skeweness with high value outliers.
  4. The entropy of both the classes happens to follow a left-skeweness distribution, indicating the existence of low valued outliers.

3. Data Preparation

3.1 Data Cleaning

Normalization

Lets see, the value range is not too far away, it will still start from minus, we can handle it with normalization. So, the value of each variable is between 0 and 1.

normalize <- function(x)
{
  return( (x-min(x)) / (max(x)-min(x)))
}

class <- bn_df$class
bn_df <- as.data.frame(lapply(bn_df[1:4], normalize))
bn_df <- cbind(bn_df, class)

summary(bn_df)
##     variance         skewness         curtosis         entropy        class    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   real:762  
##  1st Qu.:0.3800   1st Qu.:0.4515   1st Qu.:0.1599   1st Qu.:0.5578   fake:610  
##  Median :0.5436   Median :0.6022   Median :0.2543   Median :0.7239             
##  Mean   :0.5391   Mean   :0.5873   Mean   :0.2879   Mean   :0.6689             
##  3rd Qu.:0.7113   3rd Qu.:0.7704   3rd Qu.:0.3647   3rd Qu.:0.8132             
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Remove Outlier

Removing outlier for curtosis and entropy variable.

# get outliers
out_curtosis <- boxplot.stats(bn_df$curtosis)$out
out_entropy <- boxplot.stats(bn_df$entropy)$out

# get rows idx of outliers value
idx_out_curtosis <- which(bn_df$curtosis %in% c(out_curtosis))
idx_out_entropy <- which(bn_df$entropy %in% c(out_entropy))

# remove rows with outliers or ger rows without outliers
bn_df <- bn_df[-idx_out_curtosis, ]
bn_df <- bn_df[-idx_out_entropy, ]

p21 <- ggplot(bn_df, aes(x = class, y = variance)) +
  geom_boxplot()

p22 <- ggplot(bn_df, aes(x = class, y = skewness)) +
  geom_boxplot()

p23 <- ggplot(bn_df, aes(x = class, y = curtosis)) +
  geom_boxplot()

p24 <- ggplot(bn_df, aes(x = class, y = entropy)) +
  geom_boxplot()

grid.arrange(p21, p22 , p23, p24, ncol = 2)

Lets see, the outliers on curtosis and entropy variable has been successfully removed.

3.2 Training and Testing Division

Dividing the data into training and test sets. The training set is used to train the machine learning algorithms while the test set is used to evaluate the performance of the machine learning algorithms. We used a test size of 0.4, therefore our test size contained 40% of the data, remaining 60% were used for building the model.

m <- nrow(bn_df) 
set.seed(2301)

train_idx <- sample(m, 0.6*m)
train_df <- bn_df[train_idx, ]
test_df <- bn_df[-train_idx, ]  

4. Modelling

We trained 2 Algorithm of Classification to predict Banknotes Authentication based on all data after data preparation. The Algorithms are Random Forest, and Support Vector Machine.

Random Forest

On this model, we tried to build the model with 6 scenarios, where the parameters to be replaced are the number of trees and the number of variables used to build a decision tree (mtry). The goal of skenarios is to determine the best parameters of Random Forest Algorithm by comparing the results of the accuracy of each scenario.

## Random Forest
set.seed(2310)
fit.forest204 <- randomForest(formula = class ~ .,
                             data = train_df,
                             na.action = na.roughfix,
                             importance = TRUE,
                             ntree = 20,
                             mtry = 4)

set.seed(2310)
fit.forest404 <- randomForest(formula = class ~ .,
                             data = train_df,
                             na.action = na.roughfix,
                             importance = TRUE,
                             ntree = 40,
                             mtry = 4)

set.seed(2310)
fit.forest604 <- randomForest(formula = class ~ .,
                             data = train_df,
                             na.action = na.roughfix,
                             importance = TRUE,
                             ntree = 60,
                             mtry = 4)

set.seed(2310)
fit.forest203 <- randomForest(formula = class ~ .,
                              data = train_df,
                              na.action = na.roughfix,
                              importance = TRUE,
                              ntree = 20,
                              mtry = 3)

set.seed(2310)
fit.forest403 <- randomForest(formula = class ~ .,
                              data = train_df,
                              na.action = na.roughfix,
                              importance = TRUE,
                              ntree = 40,
                              mtry = 3)

set.seed(2310)
fit.forest603 <- randomForest(formula = class ~ .,
                              data = train_df,
                              na.action = na.roughfix,
                              importance = TRUE,
                              ntree = 60,
                              mtry = 3)

Support Vector Machine

## Support Vector Machine
fit.svm <- svm(formula = class ~ .,
               data = train_df)

5. Evaluation

To evaluate our models, we use accuracy, precision, recall and f1_score.

Random Forest

After building the model, we try to evaluate the 6 scenarios used, and the results are as follows.

mtry ntree Precision Recall Accuracy F1 Score
3 20 0.995 0.995 0.996 0.995
3 40 0.991 0.991 0.992 0.991
3 60 0.995 0.991 0.994 0.993
4 20 0.977 0.986 0.984 0.982
4 40 0.986 0.986 0.988 0.986
4 60 0.986 0.986 0.988 0.986

The results of the evaluation of the scenario in the table above, the number of mtries do not affect the results of the evaluation. And when using the number of parameters mrty 3 and ntree 20 the precision results are reduced. In general, the evaluation results are the same, namely, precision 99,5%, recall 99.5%, Accuracy 99.5% and F1 Score 99.5%.

##       Predicted
## Actual real fake
##   real  296    1
##   fake    1  215
## == Random Forest nTrees 20 mtry 3 == 
## Precision =  0.995 
## Recall =  0.995 
## Accuracy =  0.996 
## F1 Score =  0.995

To optimize the computation time, so that the best parameters of the scenario are mry 3 and ntree 20

Support Vector Machine

In this SVM model, we use the Radial kernel type, and get the following result.

##       Predicted
## Actual real fake
##   real  297    0
##   fake    0  216
## == Support Vector Machine == 
## Precision =  1 
## Recall =  1 
## Accuracy =  1 
## F1 Score =  1

The result of this algorithm is good because it is able to classify currency and original currency with Precision 100%, Recall 100%, Accuracy 100% and F1 Score 100%.

6. Result and Recommendation

  1. There are 2 variable have outlier, curtosis and entropy.
  2. The values for curtosis and entropy are slightly higher for real banknotes, while the values for skewness and variance are higher for the fake banknotes.
  3. The high correlation value for the class variable is variance.
  4. On Random Forest algorithm uses 6 scenarios in which ntree and mtry parameters will be replaced, the goal is to obtain the optimum parameters for building the decision tree. From the experiment, the optimum parameters are ntree = 20 and mtry = 3 with precision 99,5%, recall 99.5%, Accuracy 99.6% and F1 Score 99.5%.
  5. Try parameters ntree < 20 and mtry < 3.
  6. The results of Support Vector Machine are Precision 100%, Recall 100%, Accuracy 100% and F1 Score 100%.
  7. Based on point (1) and (2). So, the best algorithm for authentication banknotes dataset is Support Vector Machine Algorithm.