References: http://proxy.library.upenn.edu:2061/lib/upenn/reader.action?docID=10794279 charpter 3 p75 Lantz, Brett. Machine Learning with R. Olton, GB: Packt Publishing, 2013. ProQuest ebrary. Web. 24 April 2017. Copyright ? 2013. Packt Publishing. All rights reserved. https://www.kaggle.com/gargmanish/d/uciml/breast-cancer-wisconsin-data/basic-machine-learning-with-cancer/notebook https://rpubs.com/jesuscastagnetto/caret-knn-cancer-prediction

DATA SET DESCRIPTION

The dataset used in the book is the “Breast Cancer Wisconsin (Diagnostic) Data Set” from the UCI Machine Learning Repository, as described in Chapter 3 (“*Lazy Learning - Clasification Using Nearest Neighbors“) of the aforementioned book. The data set contains results of routine breast cancer screen, which allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The goal of the dataset is to practice classification analysis, to be able to predict which of sub-populations a new observations belongs to , on the basis of chosen metrics. In other words, after analysis of the cancer diagosis dataset, we will be able to preidct whether a patient has benign or malignant. Attributes: As I observed the dat can be divided into three parts: means (3-13) standard error (13-23) and Worst(23-32) each contain 10 parameter radius, texture, area, perimeter, smoothness, compactness, concavity, concave points, symmetry and fractal dimension.

Load the data

**I randomized the imported data.

Exploration of data

ggplot(data=dataset,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

p1<-frqtab(dataset$diagnosis)
pander(p1, style="rmarkdown", caption="Original diagnosis frequencies (%)")
Original diagnosis frequencies (%)
Benign Malignant
62.7 37.3

There are 357 benign cases and 212 malignant cases

a1<-ggplot(data=dataset,aes(x=dataset[,3])) + geom_histogram(bins = 50)
a2<-ggplot(data=dataset,aes(x=dataset[,4])) + geom_histogram(bins = 50)
a3<-ggplot(data=dataset,aes(x=dataset[,5])) + geom_histogram(bins = 50)
a4<-ggplot(data=dataset,aes(x=dataset[,6])) + geom_histogram(bins = 50)
a5<-ggplot(data=dataset,aes(x=dataset[,7])) + geom_histogram(bins = 50)
a6<-ggplot(data=dataset,aes(x=dataset[,8])) + geom_histogram(bins = 50)
a7<-ggplot(data=dataset,aes(x=dataset[,9])) + geom_histogram(bins = 50)
a8<-ggplot(data=dataset,aes(x=dataset[,10])) + geom_histogram(bins = 50)
a9<-ggplot(data=dataset,aes(x=dataset[,11])) + geom_histogram(bins = 50)
a10<-ggplot(data=dataset,aes(x=dataset[,12])) + geom_histogram(bins = 50)
grid.arrange(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, nrow=4, widths=c(1,1,1))

a1<-ggplot(data=dataset,aes(x=dataset[,13])) + geom_histogram(bins = 50)
a2<-ggplot(data=dataset,aes(x=dataset[,14])) + geom_histogram(bins = 50)
a3<-ggplot(data=dataset,aes(x=dataset[,15])) + geom_histogram(bins = 50)
a4<-ggplot(data=dataset,aes(x=dataset[,16])) + geom_histogram(bins = 50)
a5<-ggplot(data=dataset,aes(x=dataset[,17])) + geom_histogram(bins = 50)
a6<-ggplot(data=dataset,aes(x=dataset[,18])) + geom_histogram(bins = 50)
a7<-ggplot(data=dataset,aes(x=dataset[,19])) + geom_histogram(bins = 50)
a8<-ggplot(data=dataset,aes(x=dataset[,20])) + geom_histogram(bins = 50)
a9<-ggplot(data=dataset,aes(x=dataset[,21])) + geom_histogram(bins = 50)
a10<-ggplot(data=dataset,aes(x=dataset[,22])) + geom_histogram(bins = 50)
grid.arrange(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, nrow=4, widths=c(1,1,1))

a1<-ggplot(data=dataset,aes(x=dataset[,23])) + geom_histogram(bins = 50)
a2<-ggplot(data=dataset,aes(x=dataset[,24])) + geom_histogram(bins = 50)
a3<-ggplot(data=dataset,aes(x=dataset[,25])) + geom_histogram(bins = 50)
a4<-ggplot(data=dataset,aes(x=dataset[,26])) + geom_histogram(bins = 50)
a5<-ggplot(data=dataset,aes(x=dataset[,27])) + geom_histogram(bins = 50)
a6<-ggplot(data=dataset,aes(x=dataset[,28])) + geom_histogram(bins = 50)
a7<-ggplot(data=dataset,aes(x=dataset[,29])) + geom_histogram(bins = 50)
a8<-ggplot(data=dataset,aes(x=dataset[,30])) + geom_histogram(bins = 50)
a9<-ggplot(data=dataset,aes(x=dataset[,31])) + geom_histogram(bins = 50)
a10<-ggplot(data=dataset,aes(x=dataset[,32])) + geom_histogram(bins = 50)
grid.arrange(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, nrow=4, widths=c(1,1,1))