R Markdown

Decription

this report describe prediction of breast cancer diagnosis using Machine Learning Algorithm. The dataset used in this report is Breast Cancer Wisconsin hosted in Kaggle.

The dataset can be downloaded [here.] (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

Report Outline
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation

1. Data Extraction

The dataset is downloaded from Kaggle and saved in data folder.

bcw_df <- read.csv("data/data.csv")

To see the number of rows and column, we used dim() function. The dataset has 569 rows and 33 columns.

dim(bcw_df)
## [1] 569  33

2. Exploratory Data Analysis

To find out the column names and types, we used str() function.

str(bcw_df)
## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

From the result above, we know the following:
1. The first column is id. It is unique and unnecessary for prediction. So, it should be removed.
2. The second column is diagnosis. This should be a class variable. Currently the type is char and it should be converted to factor.
3. The last column is X. All the values are NA. So, it should be removed.

# remove unnecessary columns
bcw_df$id <- NULL
bcw_df$X33 <- NULL

# change to factor for a target variable
bcw_df$diagnosis <- as.factor(bcw_df$diagnosis)

2.1 Univariate Data Analysis

analysis of a single variable. Number of benign (B) and malignant (M) in dataset.

library(ggplot2)
ggplot(data=bcw_df, aes(x=diagnosis)) + geom_bar()

Distribution of *radius mean** variable in boxplot.

ggplot(data=bcw_df, aes(y=radius_mean)) +
  geom_boxplot() +
  labs(title = "Breast Cancer Winconsin Data", y="Radius Mean")

Distribution of *radius mean** variable in histogram.

ggplot(data = bcw_df, aes(x=radius_mean)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1 <-ggplot(data=bcw_df, aes(x=diagnosis)) + geom_bar()
p2 <- ggplot(data=bcw_df, aes(y=radius_mean)) +
  geom_boxplot() +
  labs(title = "Breast Cancer Winconsin Data", y="Radius Mean")
p3 <- ggplot(data = bcw_df, aes(x=radius_mean)) + geom_histogram()

library(gridExtra)
grid.arrange(p1, p2, p3, ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2 Bivariate Data Analysis

Analysis of two variables. Distribution radius mean variables based on diagnosis.

ggplot(data=bcw_df, aes(x=diagnosis, y=radius_mean)) +
  geom_boxplot() +
  geom_jitter(alpha=0.3,
              color =" blue",
              width = 0.2) +
  labs(title="Breast Cancer Wisconsin Data", x="Diagnosis", y="Radius Mean")

ggplot(data=bcw_df, aes(x=radius_mean, fill=diagnosis)) +
  geom_density(alpha=.3)

Observation based on radius mean and texture mean variables. Each point is a single observation. The color and shape of the observation are based on diagnosis (benign or malignant)

ggplot(data=bcw_df, aes(x=radius_mean, y=texture_mean,
                        shape=diagnosis, color=diagnosis)) +
  geom_point() +
  labs(title="Breast Cancer Wisconsin Data", x="Radius Mean", y="Texture Mean")

In general, benign has lower radius mean and texture mean measurement than malignant However, these two variables are not enough two seperate the classes.

2.3 Multivariate Data Analysis

There are three type of measurements : mean, standard error (se), and worst (mean of the three largest values). Each measurement has 10 variables so the total is 30 variables. We want to compute and visualize correlation coefficient of each measurement.

Visualize Pearson’s Correlation Coefficient for *_mean variables.

# install.packages("corrgram")
library(corrgram)
corrgram(bcw_df[2:11], order = TRUE,
         upper.panel = panel.pie)

Visualize Pearson’s Correlation Coefficient for *_se variables.

library(corrgram)
corrgram(bcw_df[12:21], order = TRUE,
         upper.panel = panel.pie)

Visualize Pearson’s Correlation Coefficient for *_worst variables.

library(corrgram)
corrgram(bcw_df[22:31], order = TRUE,
         upper.panel = panel.pie)

From the correlation coefficient, we can see that area, radius, and perimeter are co-linear. So, we need to remove two of them: area and perimeter.

We can also see that compactness, concavity, and concave.points are co-linear.

3. Data Preparation

3.1 Feature Selection

Remove *_worst variables. Based on discussion with domain expert, the all the variables with ending worst should be removed.

bcw_df2 <- bcw_df[1:21]

Remove area, perimeter, compactness, concavity.

bcw_df2$area_mean <- NULL
bcw_df2$perimeter_mean <- NULL
bcw_df2$compactness_mean <- NULL
bcw_df2$concavity_mean <- NULL

bcw_df2$area_se <- NULL
bcw_df2$perimeter_se <- NULL
bcw_df2$compactness_se <- NULL
bcw_df2$concavity_se <- NULL

dim(bcw_df2)
## [1] 569  13

3.2 Remove Outliner

3.3 Feature Scaling

3.4 PCA

3.5 Training and test data

Use set.seed() for reproducible result. Ratio train test = 70:30.

m = nrow(bcw_df2)
set.seed(2021)
train_idx <- sample(m, 0.7* m)
train_df <- bcw_df2[train_idx, ]
test_df <- bcw_df2[-train_idx, ]

4. Modeling

We use 4 machine learning algorithmms.

4.1 Logistic Regression

fit.logit <- glm(diagnosis ~. ,
                 data = train_df,
                 family = binomial)

summary(fit.logit)
## 
## Call:
## glm(formula = diagnosis ~ ., family = binomial, data = train_df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.29130  -0.06276  -0.01079   0.00266   2.41194  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -38.2941    10.8434  -3.532 0.000413 ***
## radius_mean               0.9770     0.3246   3.009 0.002618 ** 
## texture_mean              0.5801     0.1110   5.226 1.73e-07 ***
## smoothness_mean          -0.2969    51.6663  -0.006 0.995414    
## concave.points_mean     134.7234    38.6187   3.489 0.000486 ***
## symmetry_mean            42.5947    20.3845   2.090 0.036657 *  
## fractal_dimension_mean   22.7464   105.3806   0.216 0.829105    
## radius_se                 5.0455     2.9014   1.739 0.082037 .  
## texture_se               -1.5634     0.9470  -1.651 0.098737 .  
## smoothness_se            72.5956   159.1434   0.456 0.648272    
## concave.points_se      -160.9316   122.3363  -1.315 0.188347    
## symmetry_se             -27.4352    60.9517  -0.450 0.652628    
## fractal_dimension_se   -351.4736   288.8728  -1.217 0.223716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.229  on 397  degrees of freedom
## Residual deviance:  75.269  on 385  degrees of freedom
## AIC: 101.27
## 
## Number of Fisher Scoring iterations: 9

4.2 Decision Tree

library(party)
fit.ctree <- ctree(diagnosis~., data=train_df)
plot(fit.ctree, main="Conditional Inference Tree")

4.3 Random Forest

library(randomForest)
set.seed(2021)
fit.forest <- randomForest(diagnosis~., data=train_df, 
                           na.action=na.roughfix,
                           importance=TRUE)
fit.forest
## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = train_df, importance = TRUE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 5.28%
## Confusion matrix:
##     B   M class.error
## B 249   9  0.03488372
## M  12 128  0.08571429

4.4 Support Vector Machine (SVM)

library(e1071)
set.seed(2021)
fit.svm <- svm(diagnosis~., data=train_df)
fit.svm
## 
## Call:
## svm(formula = diagnosis ~ ., data = train_df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  119

5. Evaluation

We compute the accuracy, precision, recall, and F1 Score.

performance <- function(table, n=2){
  tn = table[1,1]
  fp = table[1,2] 
  fn = table[2,1]
  tp = table[2,2]
  
  sensitivity = tp/(tp+fn) # recall
  specificity = tn/(tn+fp) 
  ppp = tp/(tp+fp) # precision
  npp = tn/(tn+fn)
  hitrate = (tp+tn)/(tp+tn+fp+fn) # accuracy
  
  result <- paste("Sensitivity = ", round(sensitivity, n) ,
  "\nSpecificity = ", round(specificity, n),
  "\nPositive Predictive Value = ", round(ppp, n),
  "\nNegative Predictive Value = ", round(npp, n),
  "\nAccuracy = ", round(hitrate, n), "\n", sep="")
  
  cat(result)
}
prob <- predict(fit.logit,test_df, type="response")
logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE),
                     labels=c("benign", "malignant"))
logit.perf <- table(test_df$diagnosis, logit.pred,
                    dnn=c("Actual","Predicted"))
logit.perf
##       Predicted
## Actual benign malignant
##      B     95         4
##      M      7        65
performance(logit.perf)
## Sensitivity = 0.9
## Specificity = 0.96
## Positive Predictive Value = 0.94
## Negative Predictive Value = 0.93
## Accuracy = 0.94
ctree.pred <- predict(fit.ctree, test_df, type = "response")
ctree.perf <- table(test_df$diagnosis, ctree.pred,
                    dnn=c("Actual","Predicted"))
ctree.perf
##       Predicted
## Actual  B  M
##      B 92  7
##      M  7 65
forest.pred <- predict(fit.forest, test_df, type = "response")
forest.perf <- table(test_df$diagnosis, forest.pred,
                    dnn=c("Actual", "Predicted"))
forest.perf
##       Predicted
## Actual  B  M
##      B 97  2
##      M  7 65
svm.pred <- predict(fit.svm, test_df, type = "response")
svm.perf <- table(test_df$diagnosis, svm.pred,
                    dnn=c("Actual", "Predicted"))
svm.perf
##       Predicted
## Actual  B  M
##      B 96  3
##      M  8 64

6. Recomendation

  1. Random forest algorithm is the best among all the tested algorithms.
  2. Based on decision tree model, the most importance variables are concave.point, radius_mean, and texture_mean
  3. The result can be improved by better data preparation or using other algorithm. However, The current result surpass human level performance (79% accuracy). So, it can be deployed as second opinion for the doctor.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.