library(alr3)
## Loading required package: car
## Loading required package: carData
data("banknote") # calling the data set
banknote$Y<- factor(banknote$Y)
str(banknote)
## 'data.frame': 200 obs. of 7 variables:
## $ Length : num 215 215 215 215 215 ...
## $ Left : num 131 130 130 130 130 ...
## $ Right : num 131 130 130 130 130 ...
## $ Bottom : num 9 8.1 8.7 7.5 10.4 9 7.9 7.2 8.2 9.2 ...
## $ Top : num 9.7 9.5 9.6 10.4 7.7 10.1 9.6 10.7 11 10 ...
## $ Diagonal: num 141 142 142 142 142 ...
## $ Y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
The data is in a data frame format with six numerical continuous variables and one factor variable Y whith two levels which determines the genuine or counterfeit banknotes.
library(ggplot2)
ggplot(data = banknote, aes(Y, fill=Y))+geom_bar(width = 0.4)
Frist let’s draw a side by side histogram of all of the variables to see which one shows the distinction of between genuine and counterfeit banknotes better
library(gridExtra)
p1=ggplot(data = banknote, aes(Length, fill=Y))+geom_histogram(binwidth = 0.25)
p2=ggplot(data = banknote, aes(Left, fill=Y))+geom_histogram(binwidth = 0.25)
p3=ggplot(data = banknote, aes(Right, fill=Y))+geom_histogram(binwidth = 0.25)
p4=ggplot(data = banknote, aes(Bottom, fill=Y))+geom_histogram(binwidth = 0.25)
p5=ggplot(data = banknote, aes(Top, fill=Y))+geom_histogram(binwidth = 0.25)
p6=ggplot(data = banknote, aes(Diagonal, fill=Y))+geom_histogram(binwidth = 0.25)
grid.arrange(p1,p2, p3, p4, p5, p6)
Diagonal is the best variable to show the distinction between genuine and counterfeit banknotes. So, we focus on this and make a histogram
ggplot(data = banknote, aes(Diagonal, fill=Y))+geom_histogram(binwidth = 0.22)
The following scatter plots shows the pairs of measurements that allow us to separate between genuine and counterfeit banknotes.
p1=ggplot(data = banknote, aes(x= Right,y = Diagonal, colour=Y))+geom_point()
p2=ggplot(data = banknote, aes(x= Bottom,y = Diagonal, colour=Y))+geom_point()
p3=ggplot(data = banknote, aes(x= Length,y = Diagonal, colour=Y))+geom_point()
p4=ggplot(data = banknote, aes(x= Left,y = Diagonal, colour=Y))+geom_point()
p5=ggplot(data = banknote, aes(x= Bottom,y = Top, colour=Y))+geom_point()
p6=ggplot(data = banknote, aes(x= Top,y = Diagonal, colour=Y))+geom_point()
grid.arrange(p1,p2, p3, p4, p5, p6, nrow=3, ncol=2)
However, the scatter plot of Top vs. Bottom contains some overlaps between the points of the two variables. Considering other five plots, it is noted that the variable Diagonal is a key variable to show the perfect separation between the two types of banknotes. In all scatter plots, the genuine banknotes have bigger diagonal with respect to other variables.
-In part 4, the structure of the plot is a barchart, and it is based on the mapping of Y with filling Y to extract the number of genuine and counterfeit banknotes. The plot shows that the number of two factors used (genuine and counterfeit) for banknotes are the same (100 for each)
-In part 5, a histogram is used which maps the frequency of the diagonal notes and the distinction between the genuine and counterfeit notes are shown by the colors. Diagonal note data are mapped to x. The red represents the genuine diagonal notes and the the blue shows the counterfeit. The main message of the plot is the sharp distinction between the two genuine and counterfeit notes. It seems that the genuine bank notes have bigger diagonal measurement rather than the counterfeit ones.
-Plots in part 6 are all scatterplots and the data of Diagonal banknotes are mapped to x and other five variables Right, Bottom, Length, Left and Top are mapped to y.The main message of these plots is the sharp difference in the genuine and counterfeit notes. Addition to this, all genuine notes have bigger diag. The relationship between each variable and the variable Diagonal cannot be determined as it seems the factors play an important role in this case and the levels of factors cause two perfectly separate classes of data (one for the genuine and the other for counterfeit notes)