This report describes Banknotes Authentication prediction using classification model. The dataset used in this project is the banknotes authentication data from UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/banknote+authentication
Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object grayscale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.
The dataset contains a total of 1372 records of different banknotes. The four left columns are data that we can use to predict whether a note is genuine or counterfeit, which is external data provided by a human, coded as 0 and 1. Where 0 represents real and 1 represents fake banknote. Machine learning algorithms require data where features and labels are separated from each other. The label means the output class or output category. In our dataset, variance, skewness, curtosis, and entropy are features whereas the class column contains the label.
Report Outline:
Extract Data
Extract data in url format into dataframe in R
loc <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/"
ds <- "data_banknote_authentication.txt"
url <- paste(loc, ds, sep ="")
bn_df <- read.table(url, sep = ",")
Change Variables Name
Change variable name to variance, skewness, curtosis, entropy, class.
names(bn_df) <- c("variance", "skewness", "curtosis", "entropy", "class")
Stucture of Data
See the data structure of dataframe. There are 1272 onservations and 5 variables.
str(bn_df)
## 'data.frame': 1372 obs. of 5 variables:
## $ variance: num 3.622 4.546 3.866 3.457 0.329 ...
## $ skewness: num 8.67 8.17 -2.64 9.52 -4.46 ...
## $ curtosis: num -2.81 -2.46 1.92 -4.01 4.57 ...
## $ entropy : num -0.447 -1.462 0.106 -3.594 -0.989 ...
## $ class : int 0 0 0 0 0 0 0 0 0 0 ...
It is evident from the output that our dataset has four features: variance, skewness, curtosis and entropy. While the class refers to whether or not the banknote is real or fake.
The variables are:
1. Variance finds how each pixel varies from the neighboring pixels and classifies them into different regions
2. Skewness is the measure of the lack of symmetry
3. Curtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution
4. Entropy is a quantity which is used to describe the amount of information which must be coded for, by a compression algorithm
5. Class contains two values 0 representing fake note and 1 representing fake note
Summary of Data
To see the statistical details of the data, the summary() function can be used, which returns the min, mean, quartile information and maximum values for each column.
summary(bn_df)
## variance skewness curtosis entropy
## Min. :-7.0421 Min. :-13.773 Min. :-5.2861 Min. :-8.5482
## 1st Qu.:-1.7730 1st Qu.: -1.708 1st Qu.:-1.5750 1st Qu.:-2.4135
## Median : 0.4962 Median : 2.320 Median : 0.6166 Median :-0.5867
## Mean : 0.4337 Mean : 1.922 Mean : 1.3976 Mean :-1.1917
## 3rd Qu.: 2.8215 3rd Qu.: 6.815 3rd Qu.: 3.1793 3rd Qu.: 0.3948
## Max. : 6.8248 Max. : 12.952 Max. :17.9274 Max. : 2.4495
## class
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4446
## 3rd Qu.:1.0000
## Max. :1.0000
We can see that values vary with different means and standard deviations, perhaps some normalization or standardization would be required prior to modeling.
Change Value of Class
Change class variable to Factor, 0 is real and 1 is fake.
bn_df$class <- factor(bn_df$class,
levels = c(0,1),
labels = c("real","fake"))
Check for Missing Value
The data doesn’t have missing value
colSums(is.na(bn_df))
## variance skewness curtosis entropy class
## 0 0 0 0 0
Check for Proportion of Target Variable
prop.table(table(bn_df$class))
##
## real fake
## 0.5553936 0.4446064
Total of real money is 762 (0.55%) and fake money is 610 (45%).
Distribution Plot
A histogram plot is then created for each variable.
#variance
h1 <- ggplot(bn_df, aes(x=variance)) +
geom_histogram(aes(y=..density..),
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")
#skewness
h2 <- ggplot(bn_df, aes(x=skewness)) +
geom_histogram(aes(y=..density..),
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")
#curtosis
h3 <- ggplot(bn_df, aes(x=curtosis)) +
geom_histogram(aes(y=..density..),
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")
#entropy
h4 <- ggplot(bn_df, aes(x=entropy)) +
geom_histogram(aes(y=..density..),
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") # Overlay with transparent density plot
grid.arrange(h1, h2 , h3, h4, ncol = 2)
We can see that perhaps the first two variables have a Gaussian-like distribution and the next two input variables may have a skewed Gaussian distribution or an exponential distribution. They may be outliers in entropy and curtosis varibale and data is not normalized. To be sure, we’re going to see at using the boxplot.
Boxplot
p21 <- ggplot(bn_df, aes(x = class, y = variance)) +
geom_boxplot()
p22 <- ggplot(bn_df, aes(x = class, y = skewness)) +
geom_boxplot()
p23 <- ggplot(bn_df, aes(x = class, y = curtosis)) +
geom_boxplot()
p24 <- ggplot(bn_df, aes(x = class, y = entropy)) +
geom_boxplot()
grid.arrange(p21, p22 , p23, p24, ncol = 2)
The entropy and curtosis variable have an outlier. This should be handled or cleaned in the data preparation step. The examining the proportion of target variable (class variable) is balance.
p1 <- ggplot(bn_df, aes(x = variance, y = skewness, color = class)) +
geom_point()
p2 <- ggplot(bn_df, aes(x = variance, y = curtosis, color = class)) +
geom_point()
p3 <- ggplot(bn_df, aes(x = skewness, y = curtosis, color = class)) +
geom_point()
p4 <- ggplot(bn_df, aes(x = variance, y = entropy, color = class)) +
geom_point()
p5 <- ggplot(bn_df, aes(x = skewness, y = entropy, color = class)) +
geom_point()
p6 <- ggplot(bn_df, aes(x = curtosis, y = entropy, color = class)) +
geom_point()
grid.arrange(p1, p2 , p3, p4, p5, p6, ncol = 2)
It is visible from the output that entropy and variance have a slight linear correlation. Similarly, there is an inverse linear correlation between the curtosis and skewness. Finally, we can see that the values for curtosis and entropy are slightly higher for real banknotes, while the values for skewness and variance are higher for the fake banknotes.
Compute and vizualize Pearson’s Correlation Coefficient (R) between
cor(bn_df[1:4])
## variance skewness curtosis entropy
## variance 1.0000000 0.2640255 -0.3808500 0.2768167
## skewness 0.2640255 1.0000000 -0.7868952 -0.5263208
## curtosis -0.3808500 -0.7868952 1.0000000 0.3188409
## entropy 0.2768167 -0.5263208 0.3188409 1.0000000
chart.Correlation(bn_df[1:4], histogram=TRUE, pch=19)
From the pair plots of the attributes, we infer the following:
Normalization
Lets see, the value range is not too far away, it will still start from minus, we can handle it with normalization. So, the value of each variable is between 0 and 1.
normalize <- function(x)
{
return( (x-min(x)) / (max(x)-min(x)))
}
class <- bn_df$class
bn_df <- as.data.frame(lapply(bn_df[1:4], normalize))
bn_df <- cbind(bn_df, class)
summary(bn_df)
## variance skewness curtosis entropy class
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 real:762
## 1st Qu.:0.3800 1st Qu.:0.4515 1st Qu.:0.1599 1st Qu.:0.5578 fake:610
## Median :0.5436 Median :0.6022 Median :0.2543 Median :0.7239
## Mean :0.5391 Mean :0.5873 Mean :0.2879 Mean :0.6689
## 3rd Qu.:0.7113 3rd Qu.:0.7704 3rd Qu.:0.3647 3rd Qu.:0.8132
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
Remove Outlier
Removing outlier for curtosis and entropy variable.
# get outliers
out_curtosis <- boxplot.stats(bn_df$curtosis)$out
out_entropy <- boxplot.stats(bn_df$entropy)$out
# get rows idx of outliers value
idx_out_curtosis <- which(bn_df$curtosis %in% c(out_curtosis))
idx_out_entropy <- which(bn_df$entropy %in% c(out_entropy))
# remove rows with outliers or ger rows without outliers
bn_df <- bn_df[-idx_out_curtosis, ]
bn_df <- bn_df[-idx_out_entropy, ]
p21 <- ggplot(bn_df, aes(x = class, y = variance)) +
geom_boxplot()
p22 <- ggplot(bn_df, aes(x = class, y = skewness)) +
geom_boxplot()
p23 <- ggplot(bn_df, aes(x = class, y = curtosis)) +
geom_boxplot()
p24 <- ggplot(bn_df, aes(x = class, y = entropy)) +
geom_boxplot()
grid.arrange(p21, p22 , p23, p24, ncol = 2)
Lets see, the outliers on curtosis and entropy variable has been successfully removed.
Dividing the data into training and test sets. The training set is used to train the machine learning algorithms while the test set is used to evaluate the performance of the machine learning algorithms. We used a test size of 0.4, therefore our test size contained 40% of the data, remaining 60% were used for building the model.
m <- nrow(bn_df)
set.seed(2301)
train_idx <- sample(m, 0.6*m)
train_df <- bn_df[train_idx, ]
test_df <- bn_df[-train_idx, ]
We trained 2 Algorithm of Classification to predict Banknotes Authentication based on all data after data preparation. The Algorithms are Random Forest, and Support Vector Machine.
Random Forest
On this model, we tried to build the model with 6 scenarios, where the parameters to be replaced are the number of trees and the number of variables used to build a decision tree (mtry). The goal of skenarios is to determine the best parameters of Random Forest Algorithm by comparing the results of the accuracy of each scenario.
## Random Forest
set.seed(2310)
fit.forest204 <- randomForest(formula = class ~ .,
data = train_df,
na.action = na.roughfix,
importance = TRUE,
ntree = 20,
mtry = 4)
set.seed(2310)
fit.forest404 <- randomForest(formula = class ~ .,
data = train_df,
na.action = na.roughfix,
importance = TRUE,
ntree = 40,
mtry = 4)
set.seed(2310)
fit.forest604 <- randomForest(formula = class ~ .,
data = train_df,
na.action = na.roughfix,
importance = TRUE,
ntree = 60,
mtry = 4)
set.seed(2310)
fit.forest203 <- randomForest(formula = class ~ .,
data = train_df,
na.action = na.roughfix,
importance = TRUE,
ntree = 20,
mtry = 3)
set.seed(2310)
fit.forest403 <- randomForest(formula = class ~ .,
data = train_df,
na.action = na.roughfix,
importance = TRUE,
ntree = 40,
mtry = 3)
set.seed(2310)
fit.forest603 <- randomForest(formula = class ~ .,
data = train_df,
na.action = na.roughfix,
importance = TRUE,
ntree = 60,
mtry = 3)
Support Vector Machine
## Support Vector Machine
fit.svm <- svm(formula = class ~ .,
data = train_df)
To evaluate our models, we use accuracy, precision, recall and f1_score.
Random Forest
After building the model, we try to evaluate the 6 scenarios used, and the results are as follows.
| mtry | ntree | Precision | Recall | Accuracy | F1 Score |
|---|---|---|---|---|---|
| 3 | 20 | 0.995 | 0.995 | 0.996 | 0.995 |
| 3 | 40 | 0.991 | 0.991 | 0.992 | 0.991 |
| 3 | 60 | 0.995 | 0.991 | 0.994 | 0.993 |
| 4 | 20 | 0.977 | 0.986 | 0.984 | 0.982 |
| 4 | 40 | 0.986 | 0.986 | 0.988 | 0.986 |
| 4 | 60 | 0.986 | 0.986 | 0.988 | 0.986 |
The results of the evaluation of the scenario in the table above, the number of mtries do not affect the results of the evaluation. And when using the number of parameters mrty 3 and ntree 20 the precision results are reduced. In general, the evaluation results are the same, namely, precision 99,5%, recall 99.5%, Accuracy 99.5% and F1 Score 99.5%.
## Predicted
## Actual real fake
## real 296 1
## fake 1 215
## == Random Forest nTrees 20 mtry 3 ==
## Precision = 0.995
## Recall = 0.995
## Accuracy = 0.996
## F1 Score = 0.995
To optimize the computation time, so that the best parameters of the scenario are mry 3 and ntree 20
Support Vector Machine
In this SVM model, we use the Radial kernel type, and get the following result.
## Predicted
## Actual real fake
## real 297 0
## fake 0 216
## == Support Vector Machine ==
## Precision = 1
## Recall = 1
## Accuracy = 1
## F1 Score = 1
The result of this algorithm is good because it is able to classify currency and original currency with Precision 100%, Recall 100%, Accuracy 100% and F1 Score 100%.