Diagnosing Breast Cancer

using k-nn algorithm

Routine breast cancer screening allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The process of early detection involves examining the breast tissue for abnormal lumps or masses. If a lump is found, a fine-needle aspiration biopsy is performed, which uses a hollow needle to extract a small sample of cells from the mass. A clinician then examines the cells under a microscope to determine whether the mass is likely to be malignant or benign.

If machine learning could automate the identification of cancerous cells, it would provide considerable benefit to the health system. Automated processes are likely to improve the efficiency of the detection process, allowing physicians to spend less time diagnosing and more time treating the disease. An automated screening system might also provide greater detection accuracy by removing the inherently subjective human component from the process.

We will investigate the utility of machine learning for detecting cancer by applying the k-NN algorithm to measurements of biopsied cells from women with abnormal breast masses.

Data Collection

We will utilize the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes the measurements from digitized images of fine-needle aspirate of a breast mass. The values represent the characteristics of the cell nuclei present in the digital image.

The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis, and 30 are numeric-valued laboratory measurements. The diagnosis is coded as “M” to indicate malignant or “B” to indicate benign.

Based on these names, all the features seem to relate to the shape and size of the cell nuclei. Unless you are an oncologist, you are unlikely to know how each relates to benign or malignant masses. These patterns will be revealed as we continue in the machine learning process.

exploring and preparing the data

Let’s explore the data and see whether we can shine some light on the relationships. In doing so, we will prepare the data for use with the k-NN learning method.

We’ll begin by importing the CSV data file, as we have done in previous chapters, saving the Wisconsin breast cancer data to the wbcd data frame:

wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

Using the str(wbcd) command, we can confirm that the data is structured with 569 examples and 32 features as we expected. The first several lines of output are as follows:

str(wbcd)
'data.frame':   569 obs. of  32 variables:
 $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
 $ diagnosis              : chr  "M" "M" "M" "M" ...
 $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
 $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
 $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
 $ area_mean              : num  1001 1326 1203 386 1297 ...
 $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
 $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
 $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
 $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
 $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
 $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
 $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
 $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
 $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
 $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
 $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
 $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
 $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
 $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
 $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
 $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
 $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
 $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
 $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
 $ area_worst             : num  2019 1956 1709 568 1575 ...
 $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
 $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
 $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
 $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
 $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
 $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...

The first variable is an integer variable named id. As this is simply a unique identifier (ID) for each patient in the data, it does not provide useful information, and we will need to exclude it from the model.

Let’s drop the id feature altogether. As it is located in the first column, we can exclude it by making a copy of the wbcd data frame without column 1:

wbcd <- wbcd[-1]

The next variable, diagnosis, is of particular interest as it is the outcome we hope to predict. This feature indicates whether the example is from a benign or malignant mass. The table() output indicates that 357 masses are benign while 212 are malignant:

table(wbcd$diagnosis)

  B   M 
357 212 

Many R machine learning classifiers require that the target feature is coded as a factor, so we will need to recode the diagnosis variable. We will also take this opportunity to give the “B” and “M” values more informative labels using the labels parameter:

wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))

Now, when we look at the prop.table() output, we notice that the values have been labeled Benign and Malignant with 62.7 percent and 37.3 percent of the masses, respectively:

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)

   Benign Malignant 
     62.7      37.3 

The remaining 30 features are all numeric, and as expected, they consist of three different measurements of ten characteristics. For illustrative purposes, we will only take a closer look at three of these features:

summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
  radius_mean       area_mean      smoothness_mean  
 Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
 1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
 Median :13.370   Median : 551.1   Median :0.09587  
 Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
 3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
 Max.   :28.110   Max.   :2501.0   Max.   :0.16340  

Looking at the features side-by-side, do you notice anything problematic about the values? Recall that the distance calculation for k-NN is heavily dependent upon the measurement scale of the input features. Since smoothness ranges from 0.05 to 0.16 and area ranges from 143.5 to 2501.0, the impact of area is going to be much larger than the smoothness in the distance calculation. This could potentially cause problems for our classifier, so let’s apply normalization to rescale the features to a standard range of values.

Transformation - normalizing numeric data

To normalize these features, we need to create a normalize() function in R. This function takes a vector x of numeric values, and for each value in x, subtracts the minimum value in x and divides by the range of values in x. Finally, the resulting vector is returned. The code for this function is as follows:

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

After executing the preceding code, the normalize() function is available for use in R. Let’s test the function on a couple of vectors:

normalize(c(1, 2, 3, 4, 5))
[1] 0.00 0.25 0.50 0.75 1.00
normalize(c(10, 20, 30, 40, 50))
[1] 0.00 0.25 0.50 0.75 1.00

The function appears to be working correctly. Despite the fact that the values in the second vector are 10 times larger than the first vector, after normalization, they both appear exactly the same.

We can now apply the normalize() function to the numeric features in our data frame. Rather than normalizing each of the 30 numeric variables individually, we will use one of R’s functions to automate the process.

The lapply() function takes a list and applies a specified function to each list element. As a data frame is a list of equal-length vectors, we can use lapply() to apply normalize() to each feature in the data frame. The final step is to convert the list returned by lapply() to a data frame, using the as.data.frame() function. The full process looks like this:

wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

In plain English, this command applies the normalize() function to columns 2 through 31 in the wbcd data frame, converts the resulting list to a data frame, and assigns it the name wbcd_n. The _n suffix is used here as a reminder that the values in wbcd have been normalized.

To confirm that the transformation was applied correctly, let’s look at one variable’s summary statistics:

summary(wbcd_n$area_mean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.1174  0.1729  0.2169  0.2711  1.0000 

As expected, the area_mean variable, which originally ranged from 143.5 to 2501.0, now ranges from 0 to 1.

Data preparation - creating training and test datasets

Although all the 569 biopsies are labeled with a benign or malignant status, it is not very interesting to predict what we already know. Additionally, any performance measures we obtain during the training may be misleading as we do not know the extent to which cases have been overfitted or how well the learner will generalize to unseen cases. A more interesting question is how well our learner performs on a dataset of unlabeled data. If we had access to a laboratory, we could apply our learner to the measurements taken from the next 100 masses of unknown cancer status, and see how well the machine learner’s predictions compare to the diagnoses obtained using conventional methods.

In the absence of such data, we can simulate this scenario by dividing our data into two portions: a training dataset that will be used to build the k-NN model and a test dataset that will be used to estimate the predictive accuracy of the model. We will use the first 469 records for the training dataset and the remaining 100 to simulate new patients.

Using the data extraction methods given in Chapter 2, Managing and Understanding Data, we will split the wbcd_n data frame into wbcd_train and wbcd_test:

wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]

If the preceding commands are confusing, remember that data is extracted from data frames using the [row, column] syntax. A blank value for the row or column value indicates that all the rows or columns should be included. Hence, the first line of code takes rows 1 to 469 and all columns, and the second line takes 100 rows from 470 to 569 and all columns.

When we constructed our normalized training and test datasets, we excluded the target variable, diagnosis. For training the k-NN model, we will need to store these class labels in factor vectors, split between the training and test datasets:

wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

This code takes the diagnosis factor in the first column of the wbcd data frame, and creates the vectors wbcd_train_labels and wbcd_test_labels. We will use these in the next steps of training and evaluating our classifier.

training a model on the data

Equipped with our training data and labels vector, we are now ready to classify our unknown records. For the k-NN algorithm, the training phase actually involves no model building; the process of training a lazy learner like k-NN simply involves storing the input data in a structured format.

To classify our test instances, we will use a k-NN implementation from the class package, which provides a set of basic R functions for classification. If this package is not already installed on your system, you can install it by typing:

install.packages("class")
Installing package into <U+393C><U+3E31>C:/Users/KEVIN/Documents/R/win-library/3.3<U+393C><U+3E32>
(as <U+393C><U+3E31>lib<U+393C><U+3E32> is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/class_7.3-14.zip'
Content type 'application/zip' length 101164 bytes (98 KB)
downloaded 98 KB
package ‘class’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\KEVIN\AppData\Local\Temp\RtmpC6mVbm\downloaded_packages

To load the package during any session in which you wish to use the functions, simply enter the library(class) command.

library(class)

The knn() function in the class package provides a standard, classic implementation of the k-NN algorithm. For each instance in the test data, the function will identify the k-Nearest Neighbors, using Euclidean distance, where k is a user-specified number. The test instance is classified by taking a “vote” among the k-Nearest Neighbors-specifically, this involves assigning the class of the majority of the k neighbors. A tie vote is broken at random.

We now have nearly everything that we need to apply the k-NN algorithm to this data. We’ve split our data into training and test datasets, each with exactly the same numeric features. The labels for the training data are stored in a separate factor vector. The only remaining parameter is k, which specifies the number of neighbors to include in the vote.

As our training data includes 469 instances, we might try k = 21, an odd number roughly equal to the square root of 469. With a two-category outcome, using an odd number eliminates the chance of ending with a tie vote.

Now we can use the knn() function to classify the test data:

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)

The knn() function returns a factor vector of predicted labels for each of the examples in the test dataset, which we have assigned to wbcd_test_pred.

evaluating model performance

The next step of the process is to evaluate how well the predicted classes in the wbcd_ test_pred vector match up with the known values in the wbcd_test_labels vector. To do this, we can use the CrossTable() function in the gmodels package, which was introduced in Chapter 2, Managing and Understanding Data. If you haven’t done so already, please install this package, using the install.packages(“gmodels”) command.

After loading the package with the library(gmodels) command, we can create a cross tabulation indicating the agreement between the two vectors. Specifying prop.chisq = FALSE will remove the unnecessary chi-square values from the output:

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)
Error: could not find function "CrossTable"

The cell percentages in the table indicate the proportion of values that fall into four categories. The top-left cell indicates the true negative results. These 61 of 100 values are cases where the mass was benign and the k-NN algorithm correctly identified it as such. The bottom-right cell indicates the true positive results, where the classifier and the clinically determined label agree that the mass is malignant. A total of 37 of 100 predictions were true positives.

The cells falling on the other diagonal contain counts of examples where the k-NN approach disagreed with the true label. The two examples in the lower-left cell are false negative results; in this case, the predicted value was benign, but the tumor was actually malignant. Errors in this direction could be extremely costly as they might lead a patient to believe that she is cancer-free, but in reality, the disease may continue to spread. The top-right cell would contain the false positive results, if there were any. These values occur when the model classifies a mass as malignant, but in reality, it was benign. Although such errors are less dangerous than a false negative result, they should also be avoided as they could lead to additional financial burden on the health care system or additional stress for the patient as additional tests or treatment may have to be provided. This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

A total of 2 out of 100, or 2 percent of masses were incorrectly classified by the k-NN approach. While 98 percent accuracy seems impressive for a few lines of R code, we might try another iteration of the model to see whether we can improve the performance and reduce the number of values that have been incorrectly classified, particularly because the errors were dangerous false negatives.

improving model performance

We will attempt two simple variations on our previous classifier. First, we will employ an alternative method for rescaling our numeric features. Second, we will try several different values for k.

Transformation - z-score standardization

Although normalization is traditionally used for k-NN classification, it may not always be the most appropriate way to rescale features. Since the z-score standardized values have no predefined minimum and maximum, extreme values are not compressed towards the center. One might suspect that with a malignant tumor, we might see some very extreme outliers as the tumors grow uncontrollably. It might, therefore, be reasonable to allow the outliers to be weighted more heavily in the distance calculation. Let’s see whether z-score standardization can improve our predictive accuracy.

To standardize a vector, we can use the R’s built-in scale() function, which, by default, rescales values using the z-score standardization. The scale() function offers the additional benefit that it can be applied directly to a data frame, so we can avoid the use of the lapply() function. To create a z-score standardized version of the wbcd data, we can use the following command:

This command rescales all the features, with the exception of diagnosis and stores the result as the wbcd_z data frame. The _z suffix is a reminder that the values were z-score transformed.

To confirm that the transformation was applied correctly, we can look at the summary statistics:

The mean of a z-score standardized variable should always be zero, and the range should be fairly compact. A z-score greater than 3 or less than -3 indicates an extremely rare value. With this in mind, the transformation seems to have worked.

As we had done earlier, we need to divide the data into training and test sets, and then classify the test instances using the knn() function. We’ll then compare the predicted labels to the actual labels using CrossTable():

Unfortunately, in the following table, the results of our new transformation show a slight decline in accuracy. The instances where we had correctly classified 98 percent of examples previously, we classified only 95 percent correctly this time. Making matters worse, we did no better at classifying the dangerous false negatives:

Testing alternative values of k

We may be able do even better by examining performance across various k values. Using the normalized training and test datasets, the same 100 records were classified using several different k values. The number of false negatives and false positives are shown for each iteration

Although the classifier was never perfect, the 1-NN approach was able to avoid some of the false negatives at the expense of adding false positives. It is important to keep in mind, however, that it would be unwise to tailor our approach too closely to our test data; after all, a different set of 100 patient records is likely to be somewhat different from those used to measure our performance.

EOF

---
title: "HW01 - k-nn algorithm - Diagnosing Breast Cancer"
author: "Hoa Quach"
output: html_notebook
---

# Diagnosing Breast Cancer
## using k-nn algorithm

Routine breast cancer screening allows the disease to be diagnosed and treated prior
to it causing noticeable symptoms. The process of early detection involves examining
the breast tissue for abnormal lumps or masses. If a lump is found, a fine-needle
aspiration biopsy is performed, which uses a hollow needle to extract a small sample
of cells from the mass. A clinician then examines the cells under a microscope to
determine whether the mass is likely to be malignant or benign.

If machine learning could automate the identification of cancerous cells, it would
provide considerable benefit to the health system. Automated processes are likely
to improve the efficiency of the detection process, allowing physicians to spend less
time diagnosing and more time treating the disease. An automated screening system
might also provide greater detection accuracy by removing the inherently subjective
human component from the process.

We will investigate the utility of machine learning for detecting cancer by applying
the k-NN algorithm to measurements of biopsied cells from women with abnormal
breast masses.

##Data Collection

We will utilize the Wisconsin Breast Cancer Diagnostic dataset from the UCI
Machine Learning Repository at http://archive.ics.uci.edu/ml. This data
was donated by researchers of the University of Wisconsin and includes the
measurements from digitized images of fine-needle aspirate of a breast mass. The
values represent the characteristics of the cell nuclei present in the digital image.

The breast cancer data includes 569 examples of cancer biopsies, each with
32 features. One feature is an identification number, another is the cancer diagnosis,
and 30 are numeric-valued laboratory measurements. The diagnosis is coded as
"M" to indicate malignant or "B" to indicate benign.

Based on these names, all the features seem to relate to the shape and size of the cell
nuclei. Unless you are an oncologist, you are unlikely to know how each relates to
benign or malignant masses. These patterns will be revealed as we continue in the
machine learning process.

##exploring and preparing the data

Let's explore the data and see whether we can shine some light on the relationships.
In doing so, we will prepare the data for use with the k-NN learning method.

We'll begin by importing the CSV data file, as we have done in previous chapters,
saving the Wisconsin breast cancer data to the wbcd data frame:

```{r eval=TRUE, echo=FALSE}
setwd("C:\\Users\\KEVIN\\Downloads\\_CSU East Bay\\R\\Machine-Learning-with-R-datasets-master")
```


```{r}
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
```

Using the str(wbcd) command, we can confirm that the data is structured with
569 examples and 32 features as we expected. The first several lines of output
are as follows:

```{r}
str(wbcd)
```

The first variable is an integer variable named id. As this is simply a unique
identifier (ID) for each patient in the data, it does not provide useful information,
and we will need to exclude it from the model.

Let's drop the id feature altogether. As it is located in the first column, we can
exclude it by making a copy of the wbcd data frame without column 1:

```{r}
wbcd <- wbcd[-1]
```

The next variable, diagnosis, is of particular interest as it is the outcome we
hope to predict. This feature indicates whether the example is from a benign
or malignant mass. The table() output indicates that 357 masses are benign
while 212 are malignant:


```{r}
table(wbcd$diagnosis)
```

Many R machine learning classifiers require that the target feature is coded as a
factor, so we will need to recode the diagnosis variable. We will also take this
opportunity to give the "B" and "M" values more informative labels using the
labels parameter:

```{r}
wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
```

Now, when we look at the prop.table() output, we notice that the values have
been labeled Benign and Malignant with 62.7 percent and 37.3 percent of the
masses, respectively:

```{r}
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
```


The remaining 30 features are all numeric, and as expected, they consist of three
different measurements of ten characteristics. For illustrative purposes, we will
only take a closer look at three of these features:

```{r}
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
```

Looking at the features side-by-side, do you notice anything problematic about the
values? Recall that the distance calculation for k-NN is heavily dependent upon
the measurement scale of the input features. Since smoothness ranges from 0.05 to
0.16 and area ranges from 143.5 to 2501.0, the impact of area is going to be much
larger than the smoothness in the distance calculation. This could potentially cause
problems for our classifier, so let's apply normalization to rescale the features to a
standard range of values.

##Transformation - normalizing numeric data

To normalize these features, we need to create a normalize() function in R. This
function takes a vector x of numeric values, and for each value in x, subtracts the
minimum value in x and divides by the range of values in x. Finally, the resulting
vector is returned. The code for this function is as follows:

```{r}
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
```

After executing the preceding code, the normalize() function is available for use in
R. Let's test the function on a couple of vectors:

```{r}
normalize(c(1, 2, 3, 4, 5))
```

```{r}
normalize(c(10, 20, 30, 40, 50))
```

The function appears to be working correctly. Despite the fact that the values in the
second vector are 10 times larger than the first vector, after normalization, they both
appear exactly the same.

We can now apply the normalize() function to the numeric features in our data
frame. Rather than normalizing each of the 30 numeric variables individually, we
will use one of R's functions to automate the process.

The lapply() function takes a list and applies a specified function to each list
element. As a data frame is a list of equal-length vectors, we can use lapply() to
apply normalize() to each feature in the data frame. The final step is to convert the
list returned by lapply() to a data frame, using the as.data.frame() function. The
full process looks like this:

```{r}
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
```

In plain English, this command applies the normalize() function to columns
2 through 31 in the wbcd data frame, converts the resulting list to a data frame,
and assigns it the name wbcd_n. The _n suffix is used here as a reminder that the
values in wbcd have been normalized.

To confirm that the transformation was applied correctly, let's look at one variable's
summary statistics:

```{r}
summary(wbcd_n$area_mean)
```

As expected, the area_mean variable, which originally ranged from 143.5 to 2501.0,
now ranges from 0 to 1.

#Data preparation - creating training and test datasets

Although all the 569 biopsies are labeled with a benign or malignant status, it is not
very interesting to predict what we already know. Additionally, any performance
measures we obtain during the training may be misleading as we do not know the
extent to which cases have been overfitted or how well the learner will generalize
to unseen cases. A more interesting question is how well our learner performs on
a dataset of unlabeled data. If we had access to a laboratory, we could apply our
learner to the measurements taken from the next 100 masses of unknown cancer
status, and see how well the machine learner's predictions compare to the diagnoses
obtained using conventional methods.

In the absence of such data, we can simulate this scenario by dividing our data into
two portions: a training dataset that will be used to build the k-NN model and a test
dataset that will be used to estimate the predictive accuracy of the model. We will
use the first 469 records for the training dataset and the remaining 100 to simulate
new patients.

Using the data extraction methods given in Chapter 2, Managing and Understanding
Data, we will split the wbcd_n data frame into wbcd_train and wbcd_test:


```{r}
wbcd_train <- wbcd_n[1:469, ]
```

```{r}
wbcd_test <- wbcd_n[470:569, ]
```


If the preceding commands are confusing, remember that data is extracted from data
frames using the [row, column] syntax. A blank value for the row or column value
indicates that all the rows or columns should be included. Hence, the first line of
code takes rows 1 to 469 and all columns, and the second line takes 100 rows from
470 to 569 and all columns.

When we constructed our normalized training and test datasets, we excluded the
target variable, diagnosis. For training the k-NN model, we will need to store
these class labels in factor vectors, split between the training and test datasets:

```{r}
wbcd_train_labels <- wbcd[1:469, 1]
```


```{r}
wbcd_test_labels <- wbcd[470:569, 1]

```


This code takes the diagnosis factor in the first column of the wbcd data frame, and
creates the vectors wbcd_train_labels and wbcd_test_labels. We will use these
in the next steps of training and evaluating our classifier.

##training a model on the data

Equipped with our training data and labels vector, we are now ready to classify our
unknown records. For the k-NN algorithm, the training phase actually involves no
model building; the process of training a lazy learner like k-NN simply involves
storing the input data in a structured format.

To classify our test instances, we will use a k-NN implementation from the class
package, which provides a set of basic R functions for classification. If this package
is not already installed on your system, you can install it by typing:

```{r}
install.packages("class")
```


To load the package during any session in which you wish to use the functions,
simply enter the library(class) command.

```{r}
library(class)
```


The knn() function in the class package provides a standard, classic
implementation of the k-NN algorithm. For each instance in the test data, the
function will identify the k-Nearest Neighbors, using Euclidean distance, where k is
a user-specified number. The test instance is classified by taking a "vote" among the
k-Nearest Neighbors-specifically, this involves assigning the class of the majority of
the k neighbors. A tie vote is broken at random.

We now have nearly everything that we need to apply the k-NN algorithm to
this data. We've split our data into training and test datasets, each with exactly the
same numeric features. The labels for the training data are stored in a separate factor
vector. The only remaining parameter is k, which specifies the number of neighbors
to include in the vote.

As our training data includes 469 instances, we might try k = 21, an odd number
roughly equal to the square root of 469. With a two-category outcome, using an odd
number eliminates the chance of ending with a tie vote.

Now we can use the knn() function to classify the test data:

```{r}
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)
```


The knn() function returns a factor vector of predicted labels for each of the
examples in the test dataset, which we have assigned to wbcd_test_pred.

##evaluating model performance

The next step of the process is to evaluate how well the predicted classes in the wbcd_
test_pred vector match up with the known values in the wbcd_test_labels vector.
To do this, we can use the CrossTable() function in the gmodels package, which
was introduced in Chapter 2, Managing and Understanding Data. If you haven't done
so already, please install this package, using the install.packages("gmodels")
command.

After loading the package with the library(gmodels) command, we can
create a cross tabulation indicating the agreement between the two vectors.
Specifying prop.chisq = FALSE will remove the unnecessary chi-square
values from the output:

```{r}
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)
```

The cell percentages in the table indicate the proportion of values that fall into four
categories. The top-left cell indicates the true negative results. These 61 of 100 values
are cases where the mass was benign and the k-NN algorithm correctly identified it
as such. The bottom-right cell indicates the true positive results, where the classifier
and the clinically determined label agree that the mass is malignant. A total of 37 of
100 predictions were true positives.

The cells falling on the other diagonal contain counts of examples where the k-NN
approach disagreed with the true label. The two examples in the lower-left cell are
false negative results; in this case, the predicted value was benign, but the tumor
was actually malignant. Errors in this direction could be extremely costly as they
might lead a patient to believe that she is cancer-free, but in reality, the disease may
continue to spread. The top-right cell would contain the false positive results, if
there were any. These values occur when the model classifies a mass as malignant,
but in reality, it was benign. Although such errors are less dangerous than a false
negative result, they should also be avoided as they could lead to additional financial
burden on the health care system or additional stress for the patient as additional
tests or treatment may have to be provided.
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 


A total of 2 out of 100, or 2 percent of masses were incorrectly classified by the k-NN
approach. While 98 percent accuracy seems impressive for a few lines of R code,
we might try another iteration of the model to see whether we can improve the
performance and reduce the number of values that have been incorrectly classified,
particularly because the errors were dangerous false negatives.

##improving model performance

We will attempt two simple variations on our previous classifier. First, we will
employ an alternative method for rescaling our numeric features. Second, we
will try several different values for k.


##Transformation - z-score standardization

Although normalization is traditionally used for k-NN classification, it may
not always be the most appropriate way to rescale features. Since the z-score
standardized values have no predefined minimum and maximum, extreme values
are not compressed towards the center. One might suspect that with a malignant
tumor, we might see some very extreme outliers as the tumors grow uncontrollably.
It might, therefore, be reasonable to allow the outliers to be weighted more heavily in
the distance calculation. Let's see whether z-score standardization can improve our
predictive accuracy.

To standardize a vector, we can use the R's built-in scale() function, which, by
default, rescales values using the z-score standardization. The scale() function
offers the additional benefit that it can be applied directly to a data frame, so we can
avoid the use of the lapply() function. To create a z-score standardized version of
the wbcd data, we can use the following command:

```{r}
wbcd_z <- as.data.frame(scale(wbcd[-1]))
```

This command rescales all the features, with the exception of diagnosis and stores
the result as the wbcd_z data frame. The _z suffix is a reminder that the values were
z-score transformed.

To confirm that the transformation was applied correctly, we can look at the
summary statistics:

```{r}
summary(wbcd_z$area_mean)
```

The mean of a z-score standardized variable should always be zero, and the range
should be fairly compact. A z-score greater than 3 or less than -3 indicates an
extremely rare value. With this in mind, the transformation seems to have worked.

As we had done earlier, we need to divide the data into training and test sets, and
then classify the test instances using the knn() function. We'll then compare the
predicted labels to the actual labels using CrossTable():


```{r}
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 21)
```

Unfortunately, in the following table, the results of our new transformation show a
slight decline in accuracy. The instances where we had correctly classified 98 percent
of examples previously, we classified only 95 percent correctly this time. Making
matters worse, we did no better at classifying the dangerous false negatives:


```{r}
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
```


##Testing alternative values of k

We may be able do even better by examining performance across various k values.
Using the normalized training and test datasets, the same 100 records were classified
using several different k values. The number of false negatives and false positives are
shown for each iteration

Although the classifier was never perfect, the 1-NN approach was able to avoid
some of the false negatives at the expense of adding false positives. It is important to
keep in mind, however, that it would be unwise to tailor our approach too closely to
our test data; after all, a different set of 100 patient records is likely to be somewhat
different from those used to measure our performance.

#EOF

#
