Description

This report describes prediction of breast cancer diagnosis using Machine Learning Algorithm. We investigated … The dataset used in this report is Breast Cancer Wisconsin hosted in Kaggle.

The dataset can be downloaded here.

Report Outline:
1.Data Extaction
2.Exploratory Data Analysis
3.Data Preparation
4.Modeling
5.Evaluation
6.Recommendation

1.Data Extraction

The dataset is downloaded from Kaggle and saved in the data folder. We use read read.csv() function to read the dataset and put in bcw_df data frame.

bcw_df<-read.csv("data/data.csv")

To see the number ofrows and column, we used dim() function.The dataset has 569 rows and 33 columns.

dim(bcw_df)
## [1] 569  33

2.Exploratory Data Analysis

To find out the column names and types, we used str() function.

str(bcw_df)
## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

From the result above, we know the following:
1. The first column is id.It is unique and unnecessary for prediction. So, it should be removed.
2. The second column is diagnosis.This should be a class variable. Currently the type is char and it should be converted to factor.
3. The last column is X.All the values are NA. So, it should be removed.

# remove unnecessary columns
bcw_df$id<-NULL
bcw_df$X<-NULL

# change tofactor for a target variable
bcw_df$diagnosis <- as.factor(bcw_df$diagnosis)

2.1. Univariate Data Analysis

Analysis satu variable. boxplot,histogram,piechart.

Analysis ofa singlevariable. Number of benign(B)and malignant(M) in diagnosis column.

#multiple graph
library(ggplot2)
p1 <- ggplot(data=bcw_df, aes(x=diagnosis)) + geom_bar()
p2 <- ggplot(data=bcw_df, aes(y=radius_mean)) + geom_boxplot()+
  labs(title="Breast Cancer Wisconsin Data", y="Radius Mean")
p3 <- ggplot(data=bcw_df, aes(x=radius_mean)) + geom_histogram()+
  labs(title="Breast Cancer Wisconsin Data", x="Radius Mean")

library(gridExtra)
grid.arrange(p1,p2,p3, ncol =3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2. Bivariate Data Analysis

Analysis dua variables. contoh: pairplot, scatterplot, boxplot dan point.

Analysis of two variables. Distribution of radius mean variable in histogram

ggplot(bcw_df, aes(x=diagnosis,y=radius_mean)) + geom_boxplot()+
  geom_jitter(alpha=0.3, 
              color ='blue',
              width =0.2)+
  labs(title="Breast Cancer Wisconsin Data",y="Radius Mean")

ggplot(data=bcw_df, aes(x=radius_mean, fill=diagnosis))+
  geom_density(alpha=0.3)

Observations based on radius mean and texture mean variables.Eachpoint is a single observation. The color and shape of the observations are based on diagnosis (benign or malignant).

ggplot(bcw_df, aes(x=radius_mean, y=texture_mean,shape=diagnosis,color=diagnosis)) + geom_point()

In general, benign has lower radius mean and texture mean measurement than malignant. However, these two variables are not enough two separate the classes.

2.3. Multivariate Data Analysis

Analysis tiga atau lebih variables. contoh : correlation coefficient.

There are three type of measurement : mean, standard error(se) amd worst (mean of the three largest values). Each measurement has 10 variables so the total is 30 variables. We want to compute and vizualize correlation coefficient of each measuremet.

Vizualize Pearson’s Correlation Coefficient for *_mean variables.

library(corrgram)
corrgram(bcw_df[2:11], order = TRUE,
         upper.panel =panel.pie)

Vizualize Pearson’s Correlation Coefficient for *_se variables.

corrgram(bcw_df[12:21], order = TRUE,
         upper.panel =panel.pie)

Vizualize Pearson’s Correlation Coefficient for *_worst variables.

corrgram(bcw_df[22:31], order = TRUE,
         upper.panel =panel.pie)

From the correlation coeffcient, we can see that area, radius , and perimeter are co-linear. So, we need to remove two of them: area and perimeter.

We can also see that compactness, concavity,and concave points are co-linear. So, we need to remove two of them: compactness and concave.points. ## 3.Data Preparation

3.1. Feature Selection

Remove *_worst variables. Based in discussion with domain expert, the allthe variables with ending worst shoul be removed.

bcw_df2<-bcw_df[1:21]

Remove area, perimeter, compactness, concavity.

bcw_df2$area_mean<-NULL
bcw_df2$perimeter_mean<-NULL
bcw_df2$compactness_mean<-NULL
bcw_df2$concavity_mean<-NULL

bcw_df2$area_se<-NULL
bcw_df2$perimeter_se<-NULL
bcw_df2$compactness_se<-NULL
bcw_df2$concavity_se<-NULL
dim(bcw_df2)
## [1] 569  13

3.2. Remove Outlier

3.3. Feature Scaling

3.4 PCA

3.5 Training and Test Division

set.seed() for reproducible result. Ratio train:test = 70:30.

This is Data Preparation Data Analysis part.

4.Modelling

This is Modelling Data Analysis part.

4.1 Logistic Regression

fit.logit <-glm(formula=diagnosis ~.,
                data= train_df,
                family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit.logit)
## 
## Call:
## glm(formula = diagnosis ~ ., family = binomial, data = train_df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.29130  -0.06276  -0.01079   0.00266   2.41194  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -38.2941    10.8434  -3.532 0.000413 ***
## radius_mean               0.9770     0.3246   3.009 0.002618 ** 
## texture_mean              0.5801     0.1110   5.226 1.73e-07 ***
## smoothness_mean          -0.2969    51.6663  -0.006 0.995414    
## concave.points_mean     134.7234    38.6187   3.489 0.000486 ***
## symmetry_mean            42.5947    20.3845   2.090 0.036657 *  
## fractal_dimension_mean   22.7464   105.3806   0.216 0.829105    
## radius_se                 5.0455     2.9014   1.739 0.082037 .  
## texture_se               -1.5634     0.9470  -1.651 0.098737 .  
## smoothness_se            72.5956   159.1434   0.456 0.648272    
## concave.points_se      -160.9316   122.3363  -1.315 0.188347    
## symmetry_se             -27.4352    60.9517  -0.450 0.652628    
## fractal_dimension_se   -351.4736   288.8728  -1.217 0.223716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.229  on 397  degrees of freedom
## Residual deviance:  75.269  on 385  degrees of freedom
## AIC: 101.27
## 
## Number of Fisher Scoring iterations: 9

4.2 Decision Tree

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
fit.ctree <- ctree(diagnosis~., data=train_df)
plot(fit.ctree, main="Conditional Inference Tree")

4.3 Random Forest

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(2021)
fit.forest <- randomForest(diagnosis~., data=train_df,
                           na.action=na.roughfix,
                           importance=TRUE)
fit.forest
## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = train_df, importance = TRUE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 5.28%
## Confusion matrix:
##     B   M class.error
## B 249   9  0.03488372
## M  12 128  0.08571429

4.4 Support Vector Machine

library(e1071)
set.seed(2021)
fit.svm <- svm(diagnosis~., data=train_df)
fit.svm
## 
## Call:
## svm(formula = diagnosis ~ ., data = train_df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  119

5.Evaluation

We compute accuracy, precision, recall and F1 Score

performance <- function(table, n=2){
  tn = table[1,1]
  fp = table[1,2] 
  fn = table[2,1]
  tp = table[2,2]
  
  sensitivity = tp/(tp+fn) # recall
  specificity = tn/(tn+fp) 
  ppp = tp/(tp+fp) # precision
  npp = tn/(tn+fn)
  hitrate = (tp+tn)/(tp+tn+fp+fn) # accuracy
  
  result <- paste("Sensitivity = ", round(sensitivity, n) ,
  "\nSpecificity = ", round(specificity, n),
  "\nPositive Predictive Value = ", round(ppp, n),
  "\nNegative Predictive Value = ", round(npp, n),
  "\nAccuracy = ", round(hitrate, n), "\n", sep="")
  
  cat(result)
}
prob <- predict(fit.logit,test_df, type="response")
logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE),
                     labels=c("benign", "malignant"))
logit.perf <- table(test_df$diagnosis, logit.pred,
                    dnn=c("Actual","Predicted"))
logit.perf
##       Predicted
## Actual benign malignant
##      B     95         4
##      M      7        65
performance(logit.perf)
## Sensitivity = 0.9
## Specificity = 0.96
## Positive Predictive Value = 0.94
## Negative Predictive Value = 0.93
## Accuracy = 0.94
ctree.pred <- predict(fit.ctree, test_df, type = "response")
ctree.perf <- table(test_df$diagnosis, ctree.pred,
                    dnn=c("Actual","Predicted"))
ctree.perf
##       Predicted
## Actual  B  M
##      B 92  7
##      M  7 65
performance(ctree.perf)
## Sensitivity = 0.9
## Specificity = 0.93
## Positive Predictive Value = 0.9
## Negative Predictive Value = 0.93
## Accuracy = 0.92
forest.pred <- predict(fit.forest, test_df, type = "response")
forest.perf <- table(test_df$diagnosis, forest.pred,
                    dnn=c("Actual","Predicted"))
forest.perf
##       Predicted
## Actual  B  M
##      B 97  2
##      M  7 65
performance(forest.perf)
## Sensitivity = 0.9
## Specificity = 0.98
## Positive Predictive Value = 0.97
## Negative Predictive Value = 0.93
## Accuracy = 0.95
svm.pred <- predict(fit.svm, test_df, type = "response")
svm.perf <- table(test_df$diagnosis, svm.pred,
                    dnn=c("Actual","Predicted"))
svm.perf
##       Predicted
## Actual  B  M
##      B 96  3
##      M  8 64
performance(svm.perf)
## Sensitivity = 0.89
## Specificity = 0.97
## Positive Predictive Value = 0.96
## Negative Predictive Value = 0.92
## Accuracy = 0.94

6.Recommendation

This is .Recommendation Data Analysis part.