# these are the required packages and some initial manipulation of the data
library(ggplot2)
library(e1071)
library(stringr)
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
cancer = read.csv("breast_cancer.csv")
cancer$id = NULL
cancer$X = NULL
cancer$diagnosis = sapply(cancer$diagnosis, function(x){ifelse(x=='M', 'malicious', 'benign')})

This analysis is on a dataset containing information on over 500 incidences of breast cancer. Each instance is classified as either benign or malicious and has various characteristics that can be used in determining the threat of the cancerous region. Two primary machine learning techniques were used to model the breast cancer dataset, Support Vector Machines and Decision Trees.

Model Creation

“80-20 split”

This next chunk of code splits the mushroom dataset into a training and a testing set. This allows us to build our model with the training set and test its accuracy with the testing set. This “80-20 split” is a traditional way of building and testing models.

set.seed(40) # setting the seed for reproducible results
cancer[,'train'] <- ifelse(runif(nrow(cancer))<0.8,1,0)
#separate training and test sets
trainset <- cancer[cancer$train==1,]
testset <- cancer[cancer$train==0,]
#get column index of train flag
trainColNum <- grep('train',names(trainset))
#remove train flag column from train and test sets
trainset <- trainset[,-trainColNum]
testset <- testset[,-trainColNum]

#get column index of predicted variable in dataset
typeColNum <- grep('diag',names(cancer))

Support Vector Machine model

The first machine learning technique we will use is called a Support Vector Machine with the aid of the “e1701” package. The SVM creates a hyperplane (or hyperplanes) that separates the classifications of a dataset. In this particular model, the hyperplane will be created with all of the columns available. As this is a many-dimensional hyperplane, we have no good way of visualizing this model. The hyperplanes can be linear or radial, depending on how well the separation of the classification is defined. Because of the many variables of this data set, the binary classification of benign or malicious, and the well-defined separation of this classification this machine learning technique is ideal for our dataset.

svm_model <- svm(diagnosis~., data=trainset, type='C-classification', kernel='radial') # create svm model

# we set the kernel to radial as this data set does not a have a linear plane that can be drawn

pred_train <-predict(svm_model,trainset) # predicting with the new SVM model
mean(pred_train==trainset$diagnosis)  # percentage of trainset predicted correctly by svm
## [1] 0.989083
pred_test <-predict(svm_model,testset) # predicting with the new SVM model
mean(pred_test==testset$diagnosis)    # percentage of testset predicted correctly by svm
## [1] 0.981982
table(pred_test, testset$diagnosis)  # confusion matrix of the predictions of the svm and the test data
##            
## pred_test   benign malicious
##   benign        72         2
##   malicious      0        37

Decision Tree model

Now we will create a model and train it with our training subset of data. The decision tree model from the “rpart” package is a powerful, yet simple, tool. This type of model is best used for categorization. However, because of the complexity of this dataset, this model will be inferior to the sophisticated SVM.

# complexity factor set to .0005
tree = rpart(diagnosis~., data=trainset, control = rpart.control(cp = .0005)) 

tree_pred = predict(tree, testset, type='class')
mean(tree_pred==testset$diagnosis) 
## [1] 0.9459459
table(tree_pred, testset$diagnosis) # confusion matrix of the test set and the model predictions
##            
## tree_pred   benign malicious
##   benign        67         1
##   malicious      5        38
tree_pred_full = predict(tree, cancer, type='class')
mean(tree_pred_full==cancer$diagnosis) # percent of full data set that was correctly predicted
## [1] 0.9578207
table(tree_pred_full, cancer$diagnosis) # confusion matrix of the full data set and the model predictions
##               
## tree_pred_full benign malicious
##      benign       340         7
##      malicious     17       205

Decision Tree Visualized

The decision tree model is easily visualized. Here you can see how the model makes decisions and ultimately predicts whether a cancerous region is benign or malicious. This visual in particular was created using the rattle package.

Plot of concave.points_worst vs. radius_worst

Now that we have visualized our decision tree model, we can see that concave.points_worst and then radius_worst are two of the most significant identifiers in determining whether or not a cancerous region is benign or malicious. Now we can plot these two characteristics and visualize what indicates whether or not a cancerous region is benign or malicious.

As you can see, there is a relatively well-defined separation between the benign and malicious, however, there is still a good bit of overlap. A model based purely on this two-dimensional space would be ineffective and would misclassify many datapoints. With the help of the SVM, a many-dimensional method, we can create a very accurate model for predictions.