Data Mining Project Stages

1. PROJECT DESCRIPTION

1.1 Introduction Breast cancer is a malignant cell growth in the breast. If left untreated, the cancer spreads to other areas of the body. Excluding skin cancer, breast cancer is the most common type of cancer in women in the United States, accounting for one of every three cancer diagnoses. Breast cancer ranks second among cancer deaths in women. This project aims at analyzing data of women residing in the state of Wisconsin, USA for to predict whether a case of breast cancer is malignant or benign.

1.2 Goal statement The goal of this project is the application of several data mining and machine learning techniques to classify whether the tumor mass is benign or malignant in women residing in the state of Wisconsin, USA. This will help in understanding the important underlaying importance of attributes thereby helping in predicting the stage of breast cancer depending on the values of these attributes. Through the understanding of nature of attributes in cancer prediction, the healthcare community can perform additional research corresponding to these attributes to help prevent pervasion of breast cancer into the population of USA.

1.3 Assumption and scope . The project assumes that the dataset collected is representative of the entire women population of Wisconsin, USA. . The data has been collected accurately . No errors have been committed while entering the collected data

The scope of the project is confined only to prediction of breast cancer to be malignant or benign for the women of Wisconsin. This project will not include any conclusions that can be made whatsoever for the remaining women population of USA. The project will not go in depths of the reason whatsoever of the greater importance of some attributes over that of other attributes in prediction of breast cancer cases, as this would require considerable domain expertise in biomedical sciences.

2. PROPOSED SOLUTION OVERVIEW

2.1 Proposed data cleaning and preparation This stage would include identifying if the data set contains any missing values or bad data. I would the convert the diagnosis (Y variable) into appropriate format (currently it is classified as M and B for malignant and benign respectively)

2.2 Proposed exploratory data analysis The EDA process will help in understanding the nature of the dataset and help in identification of potential outliers or correlated variables.

2.3 Proposed modelling overview For modelling, K-nearest neighbor, random Forest and Support Vector Machine algorithms would be used for classification of the diagnosis Y-variable in the data as malignant or benign. The data will be portioned into training and test sets, consisting of 80% and 20% of the original data respectively. These models would be built on training dataset and testing dataset would be used for testing model performance.

3. MEASURE PHASE

3.1 Dataset and data description

for the data can be found at:- https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Features in the dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, “Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes. The dataset contains information of 569 women across 32 different attributes. 1) ID number 2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)

The variables are divided into three parts first is Mean (3-13), Stranded Error (13-23) and Worst (23-32) and each contain 10 parameters (radius, texture, area, perimeter, smoothness, compactness, concavity, concave points, symmetry and fractal dimension). Mean is the means of the all cells, standard Error of all cell and worst means the worst cell.

3.2 Data cleaning

#Installing package if it is not present
package_listing <- c('data.table', 'tidyverse','DT' , 'leaflet','plotly','ggthemes')

for (required_package in package_listing) {
  if (!require(required_package, character.only = T, quietly = T)) {
    install.packages(required_package, repos = "http://cran.us.r-project.org")
    library(required_package, character.only = T)
  }
}
#Loading the packages
library(data.table)
library(tidyverse)
library(DT)
library(plotly)
library(corrplot)
library(caret)
library(class)
library(ggthemes)     # ggplot themes
library(randomForest)
library(pROC)
library("verification")
library(kernlab)      # SVM methodology
library(e1071)        # SVM methodology
library(RColorBrewer) # customized coloring of plots for svm

setwd("C:/Users/ameya/Desktop/MS BANA/Capstone Project")

breast.data <- read.csv("Breast Cancer Wisconsin data.csv")

str(breast.data)

## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

fwrite(x = as.data.frame(summary(breast.data)), file = "foo.csv")

breast.data[,33] <- NULL # Lat column had NA values. Removing it

sum(is.null(breast.data)) #No NA values

## [1] 0

breast.data[duplicated(breast.data)]  # 0 duplicates

## data frame with 0 columns and 569 rows

prop.table(table(breast.data$diagnosis)) # Class imbalance with 62% benign and 37% malignant

## 
##         B         M 
## 0.6274165 0.3725835

The dataset contained one columns in which all the values were missing. This column was removed from the dataset. Apart from this the data didn’t contain any missing values.
The dataset contained 0 duplicate values
62% are benign and 37% are malignant cancer cases in the dataset.
The dataset does contain a column where all values are null values. This column was dropped from the dataset.

3.3 Exploratory data analysis

3.3.1 Data summary The below table lists the summary of the entire dataset. This gives us an understanding of how the data is structured across columns and also helps to understand the nature of each variable.Out of 569 observations, we have 357 benign and 212 malign tumors.

3.3.2 Breakup percentage of benign and malignant cancer cases The dataset contains a class imbalance between the malignant and benign cases. Out of the 569 cases in the dataset, 62% cases are benign cases of cancer whereas 37% cases are malignant.

plot_ly(data = breast.data, x = ~diagnosis, color = ~diagnosis, type = "histogram") %>% layout(title = 'Breakup of benign and malignant cases')

3.3.3 Correlation plot An important step in exploratory data analysis step is to identify if at all there is any correlation between any of the 32 variables. By using Pearson’s correlation, the below plot was created. The positive correlation between two variables is demonstrated through the darkness of blue color, i.e. darker the blue colored box, stronger is the positive correlation between respective variables. Similarly, negative correlation between two variables is demonstrated through the darkness of orange color, i.e. darker the orange colored box, stronger is the negative correlation between respective variables.

nc=ncol(breast.data)
df <- breast.data[,3:nc-1]
df$diagnosis <- as.integer(factor(df$diagnosis))-1
correlations <- cor(df,method="pearson")
corrplot(correlations, number.cex = .9, method = "number", 
          order = "FPC",
         type = "upper", tl.cex=0.8,tl.col = "black")

The next step in the analysis to to visualize some of these correlations as scatterplots in order the better understand the relationship between these variables mentioned above. As can be seen from figure below, the correlation between the pairs of two variable is very high.

cor.test(breast.data$radius_worst, breast.data$perimeter_mean)
plot_ly(data = breast.data, x = ~radius_worst, y = ~perimeter_mean, color = ~diagnosis) %>% layout(title = 'Perimeter mean v. Radius worst')

cor.test(breast.data$radius_worst, breast.data$area_worst)
plot_ly(data = breast.data, x = ~radius_worst, y = ~area_worst, color = ~diagnosis) %>% layout(title = 'Area worst v. Radius worst')

cor.test(breast.data$perimeter_worst, breast.data$radius_worst)
plot_ly(data = breast.data, x = ~perimeter_worst, y = ~radius_worst, color = ~diagnosis) %>% layout(title = 'Perimeter worst v. Radius worst')

cor.test(breast.data$texture_worst, breast.data$texture_mean)
plot_ly(data = breast.data, x = ~texture_worst, y = ~texture_mean, color = ~diagnosis) %>% layout(title = 'Texture worst v. Texture mean')

3.3.4 Box plot Finally, to better understand the distribution of the data for each of the benign and malignant class, it becomes essential to plot boxplots for all the columns separated by benign and malignant classes. Figure 9 shows this. We can see that almost all the attributes have outliers present in them when we break them into malignant and benign classes.

#Density plot of each column
newNames = c(
  "fractal_dimension_mean",  "fractal_dimension_se", "fractal_dimension_worst",
  "symmetry_mean", "symmetry_se", "symmetry_worst",
  "concave.points_mean", "concave.points_se", "concave.points_worst",
  "concavity_mean","concavity_se", "concavity_worst",
  "compactness_mean", "compactness_se", "compactness_worst",
  "smoothness_mean", "smoothness_se", "smoothness_worst",
  "area_mean", "area_se", "area_worst",
  "perimeter_mean",  "perimeter_se", "perimeter_worst",
  "texture_mean" , "texture_se", "texture_worst",
  "radius_mean", "radius_se", "radius_worst"
)

bc.data = (breast.data[,newNames])
bc.diag = breast.data[,2]

scales <- list(x=list(relation="free"),y=list(relation="free"), cex=0.6)
featurePlot(x=bc.data, y=bc.diag, plot="box",scales=scales,
            layout = c(6,5), auto.key = list(columns = 2))

featurePlot(x=bc.data, y=bc.diag, plot="density",scales=scales,
            layout = c(6,5), auto.key = list(columns = 2),dark.theme = T)

4. MODELLING PHASE

In this phase, several supervised machine learning techniques would be used to perform classification for the response variable. Before performing any modelling, we would sample 80% of the original data and use it as the training set. The remaining 20% is used as test set. K-Nearest Neighbor, random Forest and Support Vector Machine models would be used to perform classification of the response variable.

4.1 k-nearest neighbor

knn is essentially classification by finding the most similar data points in the training data, and making an educated guess based on their classifications. K is number of nearest neighbors that the classifier will use to make its prediction. KNN makes predictions based on the outcome of the K neighbors closest to that point. One of the most popular choices to measure this distance is known as Euclidean.

To determine the value of k, we will apply kNN algorithm to the training dataset across different values of k from 1 to 30 and based on the misclassification rate choose the optimal value of k. Below figure shows misclassification for different values of k. The plot above displays the model prediction accuracy along with the corresponding value of k.

From the below figure, it can be see that the misclassification increases little when increasing the value of k and then decreases sharply. The number of misclassifications then increases as the value of k increases after which it becomes stable. The lowest number of misclassifications are occurring for K= 11,12,13.

out_labels <- NULL
  mis <- NULL
  
  for(i in 1:30){
    out_labels <- knn(train=train[,-1], test=test[,-1], cl=train$diagnosis, k=i, prob=T)
    mis[i] <- sum(test[,1] != out_labels)
  }
  
  # Assigning a vector for number of clusters
  num_cl <- c(1:30)

    df <- as.data.frame(cbind(mis, num_cl))
  ggplot(df,aes(x = num_cl, y = mis)) + 
    geom_line(size = 2, color = "light grey") + 
    xlab("Number of clusters") + 
    ylab("Misclassifications") + 
    ggtitle("Misclassifications vs Number of Clusters") + 
    theme_hc(bgcolor = "darkunica") + scale_colour_hc("darkunica")

Since K=11, 12 and 13 have the lowest number of misclassifications. So, we performed kNN predictions for these 3 values of k. As misclassification rate is same for all three k values, we will get same confusion matrix for them. Figure below shows the confusion matrix and accuracy of the model. The accuracy of the model is 0.9823.

out_labels_11 <- knn(train[,-1], test[,-1], train$diagnosis, k = 11, prob = TRUE)
cm_knn_11  <- confusionMatrix(out_labels_11, test$diagnosis)
cm_knn_11

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 70  6
##          1  1 36
##                                           
##                Accuracy : 0.9381          
##                  95% CI : (0.8765, 0.9747)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 1.718e-14       
##                                           
##                   Kappa : 0.8641          
##  Mcnemar's Test P-Value : 0.1306          
##                                           
##             Sensitivity : 0.9859          
##             Specificity : 0.8571          
##          Pos Pred Value : 0.9211          
##          Neg Pred Value : 0.9730          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.6726          
##       Balanced Accuracy : 0.9215          
##                                           
##        'Positive' Class : 0               
##

4.2 Random Forest

Random Forests, also known as random decision forests, are a popular ensemble method that can be used to build predictive models for both classification and regression problems. Ensemble methods use multiple learning models to gain better predictive results - in the case of a random Forest, the model creates an entire forest of random uncorrelated decision trees to arrive at the best possible answer.
The random Forest starts with a standard machine learning technique called a “decision tree” which, in ensemble terms, corresponds to our weak learner. In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets. The random Forest takes this notion to the next level by combining trees with the notion of an ensemble. Thus, in ensemble terms, the trees are weak learners and the random Forest is a strong learner.

For the entire breast cancer dataset, we build a random Forest model for up to 500 decision trees. The below figure shows the performance of random Forest. It tells the error for our different classes (colored) and out-of-bag samples (black) over the number of trees. Classes are in the same order as the data meaning red color is for malignant class and green color is for benign class.

  set.seed(124)
  rf <- randomForest(diagnosis~., data = breast.data[-1], 
                           ntree=500, proximity = T, importance=T)
  plot(rf, main="Random Forest: Error rate vs. Number of trees")

  print(rf)

## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = breast.data[-1],      ntree = 500, proximity = T, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 3.87%
## Confusion matrix:
##     B   M class.error
## B 348   9  0.02521008
## M  13 199  0.06132075

4.3 Tuned random Forest

The error rate for original random Forest model is 4.04%. Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests utilizing bootstrap aggregating (bagging) to sub-sample data samples used for training. OOB is the mean prediction error on each training sample x???, using only the trees that did not have x??? in their bootstrap sample.

Subsampling allows one to define an out-of-bag estimate of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimates actual performance improvement and the optimal number of iterations

  mtry <- tuneRF(breast.data[-1:-2],breast.data$diagnosis, ntreeTry=500,
                 stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)

## mtry = 5  OOB error = 3.69% 
## Searching left ...
## mtry = 4     OOB error = 3.87% 
## -0.04761905 0.01 
## Searching right ...
## mtry = 7     OOB error = 3.51% 
## 0.04761905 0.01 
## mtry = 10    OOB error = 3.69% 
## -0.05 0.01

Two parameters are important in the random forest algorithm:
* Number of trees used in the forest (ntree) and * Number of random variables used in each tree (mtry). To find optimal value of mtry, apply a similar procedure such that random forest is run 10 times. The optimal number of predictors selected for split is selected for which out of bag error rate stabilizes and reach minimum.

  best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
  print(mtry)

##        mtry   OOBError
## 4.OOB     4 0.03866432
## 5.OOB     5 0.03690685
## 7.OOB     7 0.03514938
## 10.OOB   10 0.03690685

  print(best.m)

## [1] 7

From the above figure, we can see that, mtry=7 is the best mtry since it has the lowest OOB error. We apply this best value of mtry into the random Forest model to get the below error rate.

Parameters in tuneRF function * The stepFactor specifies at each iteration, mtry is inflated (or deflated) by this value.
* The improve specifies the (relative) improvement in OOB error must be by this much for the search to continue.
* The trace specifies whether to print the progress of the search.
* The plot specifies whether to plot the OOB error as function of mtry.

  tuned.rf.model<- randomForest(diagnosis~.,data=breast.data[-1], mtry=best.m, 
                    importance=TRUE,ntree=500)
  print(tuned.rf.model)

## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = breast.data[-1],      mtry = best.m, importance = TRUE, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 3.51%
## Confusion matrix:
##     B   M class.error
## B 349   8  0.02240896
## M  12 200  0.05660377

There’s two types of importance measures shown below. The accuracy one tests to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables. The mean decrease in accuracy (MDA) and GINI.
GINI importance measures the average gain of purity by splits of a given variable. If the variable is useful, it tends to split mixed labeled nodes into pure single class nodes. Splitting by a permuted variable tend neither to increase nor decrease node purities. Again, it tests to see the result if each variable is taken out and a high score means the variable was important. GINI importance is closely related to the local decision function, that random Forest uses to select the best available split.

varImpPlot(tuned.rf.model,main="Important variables using Random Forest method")

4.4 Support Vector Machine

SVM depends on supervised learning models and trained by learning algorithms. A SVM generates parallel partitions by generating two parallel lines. For each category of data in a high-dimensional space and uses almost all attributes. It separates the space in a single pass to generate flat and linear partitions. Divide the 2 categories by a clear gap that should be as wide as possible. Do this partitioning by a plane called hyperplane.
An SVM creates hyperplanes that have the largest margin in a high-dimensional space to separate given data into classes. The margin between the 2 classes represents the longest distance between closest data points of those classes.

We apply the svm function on our training dataset to get predicted values. The correlation matrix of this model on the test data looks like in figure 15. The accuracy of the model is 95.58%

  svm.model <- svm(diagnosis~., data = train, kernel = "linear", 
                   type = "C-classification",scale = FALSE)
  pred.svm <- predict(svm.model, test[,-1])
  cm_svm <- confusionMatrix(pred.svm, test$diagnosis)
  cm_svm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 70  5
##          1  1 37
##                                          
##                Accuracy : 0.9469         
##                  95% CI : (0.888, 0.9803)
##     No Information Rate : 0.6283         
##     P-Value [Acc > NIR] : 1.866e-15      
##                                          
##                   Kappa : 0.8841         
##  Mcnemar's Test P-Value : 0.2207         
##                                          
##             Sensitivity : 0.9859         
##             Specificity : 0.8810         
##          Pos Pred Value : 0.9333         
##          Neg Pred Value : 0.9737         
##              Prevalence : 0.6283         
##          Detection Rate : 0.6195         
##    Detection Prevalence : 0.6637         
##       Balanced Accuracy : 0.9334         
##                                          
##        'Positive' Class : 0              
##

Since this dataset contains 30 variables, it is impossible to plot 30 dimensions. In order to plot 2D/3D decision boundary your data has to be 2D/3D. With 30 features you are left with statistical analysis, no actual visual inspection. Hence in order to visualize the SVM model for classification, I tried to plot the model against two pairs of 4 different predictors.

  plot(svm.model, train, perimeter_mean ~radius_worst)

  plot(svm.model, train, area_worst ~radius_worst)

4.4 Tuned SVM

Linear Model
To improve the performance of SVM model, we can tune it by specifying the cost and epsilon value. To improve the performance of the support vector machine we will need to select the best parameters for the model. The standard way of doing it is by doing a grid search. It means we will train a lot of models for the different couples of ?? and cost and choose the best one. We use the tune method to train models with epsilon = 0, 0.1, 0.2., 1 and cost = 22, 23, ., 25.

    tuneResult <- tune(svm, diagnosis ~ .,  data = train,
                       ranges = list(epsilon = seq(0,1,0.1), cost = 2^(2:5)))
    plot(tuneResult)

From the results, we can see that the best parameter is epsilon=0 , gamma: 0.033 and cost=4.We can further tune the model by keeping epsilon value = 0, 0.01, 0.02,.,0.1 and cost = 1,1.1,.,2 to get below figure. From the results, we can see that the best parameter is epsilon=0 , gamma: 0.033 and cost=1.1.

    tuneResult1 <- tune(svm, diagnosis ~ .,  data = train,
                       ranges = list(epsilon = seq(0,0.1,0.01), 
                                     cost = seq(1,2, 0.1)))
    plot(tuneResult1)

    print(tuneResult1)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  epsilon cost
##        0  1.8
## 
## - best performance: 0.02193237

    summary(tuneResult1$best.model)

## 
## Call:
## best.tune(method = svm, train.x = diagnosis ~ ., data = train, 
##     ranges = list(epsilon = seq(0, 0.1, 0.01), cost = seq(1, 
##         2, 0.1)))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1.8 
##       gamma:  0.03333333 
## 
## Number of Support Vectors:  95
## 
##  ( 50 45 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

The confusion matrix for testing the performance of out tuned model on training and testing dataset is shown below. On the train data, we get an accuracy of 0.98688 and on the test data, we get an accuracy of 0.9558

      predict.svm.linear.train <- predict(tuneResult1$best.model, newdata = train)
      confusionMatrix(predict.svm.linear.train, train$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 285   5
##          1   1 165
##                                           
##                Accuracy : 0.9868          
##                  95% CI : (0.9716, 0.9952)
##     No Information Rate : 0.6272          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9717          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9965          
##             Specificity : 0.9706          
##          Pos Pred Value : 0.9828          
##          Neg Pred Value : 0.9940          
##              Prevalence : 0.6272          
##          Detection Rate : 0.6250          
##    Detection Prevalence : 0.6360          
##       Balanced Accuracy : 0.9835          
##                                           
##        'Positive' Class : 0               
##

Radial Model

To get a radial boundary between the classes, we use kernel=radial. Since we don’t know which cost will produce optimal classification boundary, We can use the tune() command to try several different values of cost as well as several different values of ?? , a scaling parameter used to fit nonlinear boundaries. We use tune function which incorporates 10-fold cross validation to give the best cost value and gamma value giving least amount of error. Using a cost range of cost = c(1,2,3,4,5) and gamma range of c(0.5,1,2,3,4) we perform model tuning similar to above. The best model has a cost =2 and gamma of 0.5 The confusion matrix for testing the performance of our tuned model on training and testing dataset is shown below. On the train data, we get an accuracy of 100% and on the test data, we get an accuracy of 78.76%

    tune.out.radial <- tune(svm, diagnosis~., data = train, kernel = "radial",
                     ranges = list(cost = c(1,2,3,4,5),
                                   gamma = c(1,2,3,4)))
    svm.radial <- tune.out.radial$best.model
    summary(svm.radial)

## 
## Call:
## best.tune(method = svm, train.x = diagnosis ~ ., data = train, 
##     ranges = list(cost = c(1, 2, 3, 4, 5), gamma = c(1, 2, 3, 
##         4)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  2 
##       gamma:  1 
## 
## Number of Support Vectors:  456
## 
##  ( 170 286 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

    predict.svm.radial.train <- predict(svm.radial, newdata = train)
    confusionMatrix(predict.svm.radial.train, train$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 286   0
##          1   0 170
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9919, 1)
##     No Information Rate : 0.6272     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6272     
##          Detection Rate : 0.6272     
##    Detection Prevalence : 0.6272     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

    predict.svm.radial.test <- predict(svm.radial, newdata = test)
    confusionMatrix(predict.svm.radial.test, test$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 71 41
##          1  0  1
##                                           
##                Accuracy : 0.6372          
##                  95% CI : (0.5414, 0.7255)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 0.4646          
##                                           
##                   Kappa : 0.0297          
##  Mcnemar's Test P-Value : 4.185e-10       
##                                           
##             Sensitivity : 1.00000         
##             Specificity : 0.02381         
##          Pos Pred Value : 0.63393         
##          Neg Pred Value : 1.00000         
##              Prevalence : 0.62832         
##          Detection Rate : 0.62832         
##    Detection Prevalence : 0.99115         
##       Balanced Accuracy : 0.51190         
##                                           
##        'Positive' Class : 0               
##

5. CONCLUSION

Data mining and machine learning techniques to classify whether the tumor mass is benign or malignant in women residing in the state of Wisconsin, USA.
1. Exploratory data analysis revealed that there is high correlation between variables Perimeter_mean and radius_worst Area_worst and radius_worst, Perimeter_worst and radius_worst, Texture_mean and texture_worst.
2. The optimal value of K is 11 for k-Nearest Neighbor classifier and it gives an accuracy of 98.23% on test data.
3. For random Forest model, error rate is 4.04%. After tuning the model based on Number of random variables used in each tree (mtry), the error rate decreases slightly to 3.87%.
4. Top 5 predictor variables for classification according to variable importance of random Forest model are Radius worst, Concave points worst, Area worst, Perimeter worst and Concave points mean.
5. Support Vector Machine model accuracy for classification was 95.58%.
6. The radial tuned SVM model fits the training data very well correctly predicting all tumor classes but doesn’t do so well on the test data with an accuracy of just 78.76%.
7. The linear tuned SVM model gives good accuracy on both training and test data. We get an accuracy of 0.98688 on training data and on the test data, we get an accuracy of 0.9558.

BREAST CANCER WISCONSIN PREDICTION (Supervised Machine Learning)

Ameya Jamgade