DATA SET DESCRIPTION

The dataset used in the book is the “Breast Cancer Wisconsin (Diagnostic) Data Set” from the UCI Machine Learning Repository, as described in Chapter 3 (“*Lazy Learning - Clasification Using Nearest Neighbors“) of the aforementioned book. The data set contains results of routine breast cancer screen, which allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The goal of the dataset is to practice classification analysis, to be able to predict which of sub-populations a new observations belongs to , on the basis of chosen metrics. In other words, after analysis of the cancer diagosis dataset, we will be able to preidct whether a patient has benign or malignant. Attributes: As I observed the dat can be divided into three parts: means (3-13) standard error (13-23) and Worst(23-32) each contain 10 parameter radius, texture, area, perimeter, smoothness, compactness, concavity, concave points, symmetry and fractal dimension.

Load the data

**I randomized the imported data.

Exploration of data

ggplot(data=dataset,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

p1<-frqtab(dataset$diagnosis)
pander(p1, style="rmarkdown", caption="Original diagnosis frequencies (%)")

Original diagnosis frequencies (%)
Benign	Malignant
62.7	37.3

There are 357 benign cases and 212 malignant cases

a1<-ggplot(data=dataset,aes(x=dataset[,3])) + geom_histogram(bins = 50)
a2<-ggplot(data=dataset,aes(x=dataset[,4])) + geom_histogram(bins = 50)
a3<-ggplot(data=dataset,aes(x=dataset[,5])) + geom_histogram(bins = 50)
a4<-ggplot(data=dataset,aes(x=dataset[,6])) + geom_histogram(bins = 50)
a5<-ggplot(data=dataset,aes(x=dataset[,7])) + geom_histogram(bins = 50)
a6<-ggplot(data=dataset,aes(x=dataset[,8])) + geom_histogram(bins = 50)
a7<-ggplot(data=dataset,aes(x=dataset[,9])) + geom_histogram(bins = 50)
a8<-ggplot(data=dataset,aes(x=dataset[,10])) + geom_histogram(bins = 50)
a9<-ggplot(data=dataset,aes(x=dataset[,11])) + geom_histogram(bins = 50)
a10<-ggplot(data=dataset,aes(x=dataset[,12])) + geom_histogram(bins = 50)
grid.arrange(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, nrow=4, widths=c(1,1,1))

a1<-ggplot(data=dataset,aes(x=dataset[,13])) + geom_histogram(bins = 50)
a2<-ggplot(data=dataset,aes(x=dataset[,14])) + geom_histogram(bins = 50)
a3<-ggplot(data=dataset,aes(x=dataset[,15])) + geom_histogram(bins = 50)
a4<-ggplot(data=dataset,aes(x=dataset[,16])) + geom_histogram(bins = 50)
a5<-ggplot(data=dataset,aes(x=dataset[,17])) + geom_histogram(bins = 50)
a6<-ggplot(data=dataset,aes(x=dataset[,18])) + geom_histogram(bins = 50)
a7<-ggplot(data=dataset,aes(x=dataset[,19])) + geom_histogram(bins = 50)
a8<-ggplot(data=dataset,aes(x=dataset[,20])) + geom_histogram(bins = 50)
a9<-ggplot(data=dataset,aes(x=dataset[,21])) + geom_histogram(bins = 50)
a10<-ggplot(data=dataset,aes(x=dataset[,22])) + geom_histogram(bins = 50)
grid.arrange(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, nrow=4, widths=c(1,1,1))

a1<-ggplot(data=dataset,aes(x=dataset[,23])) + geom_histogram(bins = 50)
a2<-ggplot(data=dataset,aes(x=dataset[,24])) + geom_histogram(bins = 50)
a3<-ggplot(data=dataset,aes(x=dataset[,25])) + geom_histogram(bins = 50)
a4<-ggplot(data=dataset,aes(x=dataset[,26])) + geom_histogram(bins = 50)
a5<-ggplot(data=dataset,aes(x=dataset[,27])) + geom_histogram(bins = 50)
a6<-ggplot(data=dataset,aes(x=dataset[,28])) + geom_histogram(bins = 50)
a7<-ggplot(data=dataset,aes(x=dataset[,29])) + geom_histogram(bins = 50)
a8<-ggplot(data=dataset,aes(x=dataset[,30])) + geom_histogram(bins = 50)
a9<-ggplot(data=dataset,aes(x=dataset[,31])) + geom_histogram(bins = 50)
a10<-ggplot(data=dataset,aes(x=dataset[,32])) + geom_histogram(bins = 50)
grid.arrange(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, nrow=4, widths=c(1,1,1))

Classification analysis

DATA SET DESCRIPTION

Load the data

Exploration of data