This report describes prediction of breast cancer diagnosis using Machine Learning Algorithm. We investigated … The dataset used in this report is Breast Cancer Wisconsin hosted in Kaggle.
The dataset can be downloaded here.
Report Outline:
1.Data Extaction
2.Exploratory Data Analysis
3.Data Preparation
4.Modeling
5.Evaluation
6.Recommendation
The dataset is downloaded from Kaggle and saved in the data folder. We use read read.csv() function to read the dataset and put in bcw_df data frame.
bcw_df<-read.csv("data/data.csv")
To see the number ofrows and column, we used dim() function.The dataset has 569 rows and 33 columns.
dim(bcw_df)
## [1] 569 33
To find out the column names and types, we used str() function.
str(bcw_df)
## 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ X : logi NA NA NA NA NA NA ...
From the result above, we know the following:
1. The first column is id.It is unique and unnecessary for prediction. So, it should be removed.
2. The second column is diagnosis.This should be a class variable. Currently the type is char and it should be converted to factor.
3. The last column is X.All the values are NA. So, it should be removed.
# remove unnecessary columns
bcw_df$id<-NULL
bcw_df$X<-NULL
# change tofactor for a target variable
bcw_df$diagnosis <- as.factor(bcw_df$diagnosis)
Analysis satu variable. boxplot,histogram,piechart.
Analysis ofa singlevariable. Number of benign(B)and malignant(M) in diagnosis column.
#multiple graph
library(ggplot2)
p1 <- ggplot(data=bcw_df, aes(x=diagnosis)) + geom_bar()
p2 <- ggplot(data=bcw_df, aes(y=radius_mean)) + geom_boxplot()+
labs(title="Breast Cancer Wisconsin Data", y="Radius Mean")
p3 <- ggplot(data=bcw_df, aes(x=radius_mean)) + geom_histogram()+
labs(title="Breast Cancer Wisconsin Data", x="Radius Mean")
library(gridExtra)
grid.arrange(p1,p2,p3, ncol =3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Analysis dua variables. contoh: pairplot, scatterplot, boxplot dan point.
Analysis of two variables. Distribution of radius mean variable in histogram
ggplot(bcw_df, aes(x=diagnosis,y=radius_mean)) + geom_boxplot()+
geom_jitter(alpha=0.3,
color ='blue',
width =0.2)+
labs(title="Breast Cancer Wisconsin Data",y="Radius Mean")
ggplot(data=bcw_df, aes(x=radius_mean, fill=diagnosis))+
geom_density(alpha=0.3)
Observations based on radius mean and texture mean variables.Eachpoint is a single observation. The color and shape of the observations are based on diagnosis (benign or malignant).
ggplot(bcw_df, aes(x=radius_mean, y=texture_mean,shape=diagnosis,color=diagnosis)) + geom_point()
In general, benign has lower radius mean and texture mean measurement than malignant. However, these two variables are not enough two separate the classes.
Analysis tiga atau lebih variables. contoh : correlation coefficient.
There are three type of measurement : mean, standard error(se) amd worst (mean of the three largest values). Each measurement has 10 variables so the total is 30 variables. We want to compute and vizualize correlation coefficient of each measuremet.
Vizualize Pearson’s Correlation Coefficient for *_mean variables.
library(corrgram)
corrgram(bcw_df[2:11], order = TRUE,
upper.panel =panel.pie)
Vizualize Pearson’s Correlation Coefficient for *_se variables.
corrgram(bcw_df[12:21], order = TRUE,
upper.panel =panel.pie)
Vizualize Pearson’s Correlation Coefficient for *_worst variables.
corrgram(bcw_df[22:31], order = TRUE,
upper.panel =panel.pie)
From the correlation coeffcient, we can see that area, radius , and perimeter are co-linear. So, we need to remove two of them: area and perimeter.
We can also see that compactness, concavity,and concave points are co-linear. So, we need to remove two of them: compactness and concave.points. ## 3.Data Preparation
Remove *_worst variables. Based in discussion with domain expert, the allthe variables with ending worst shoul be removed.
bcw_df2<-bcw_df[1:21]
Remove area, perimeter, compactness, concavity.
bcw_df2$area_mean<-NULL
bcw_df2$perimeter_mean<-NULL
bcw_df2$compactness_mean<-NULL
bcw_df2$concavity_mean<-NULL
bcw_df2$area_se<-NULL
bcw_df2$perimeter_se<-NULL
bcw_df2$compactness_se<-NULL
bcw_df2$concavity_se<-NULL
dim(bcw_df2)
## [1] 569 13
set.seed() for reproducible result. Ratio train:test = 70:30.
This is Data Preparation Data Analysis part.
This is Modelling Data Analysis part.
fit.logit <-glm(formula=diagnosis ~.,
data= train_df,
family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit.logit)
##
## Call:
## glm(formula = diagnosis ~ ., family = binomial, data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.29130 -0.06276 -0.01079 0.00266 2.41194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.2941 10.8434 -3.532 0.000413 ***
## radius_mean 0.9770 0.3246 3.009 0.002618 **
## texture_mean 0.5801 0.1110 5.226 1.73e-07 ***
## smoothness_mean -0.2969 51.6663 -0.006 0.995414
## concave.points_mean 134.7234 38.6187 3.489 0.000486 ***
## symmetry_mean 42.5947 20.3845 2.090 0.036657 *
## fractal_dimension_mean 22.7464 105.3806 0.216 0.829105
## radius_se 5.0455 2.9014 1.739 0.082037 .
## texture_se -1.5634 0.9470 -1.651 0.098737 .
## smoothness_se 72.5956 159.1434 0.456 0.648272
## concave.points_se -160.9316 122.3363 -1.315 0.188347
## symmetry_se -27.4352 60.9517 -0.450 0.652628
## fractal_dimension_se -351.4736 288.8728 -1.217 0.223716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 516.229 on 397 degrees of freedom
## Residual deviance: 75.269 on 385 degrees of freedom
## AIC: 101.27
##
## Number of Fisher Scoring iterations: 9
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
fit.ctree <- ctree(diagnosis~., data=train_df)
plot(fit.ctree, main="Conditional Inference Tree")
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(2021)
fit.forest <- randomForest(diagnosis~., data=train_df,
na.action=na.roughfix,
importance=TRUE)
fit.forest
##
## Call:
## randomForest(formula = diagnosis ~ ., data = train_df, importance = TRUE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 5.28%
## Confusion matrix:
## B M class.error
## B 249 9 0.03488372
## M 12 128 0.08571429
library(e1071)
set.seed(2021)
fit.svm <- svm(diagnosis~., data=train_df)
fit.svm
##
## Call:
## svm(formula = diagnosis ~ ., data = train_df)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 119
We compute accuracy, precision, recall and F1 Score
performance <- function(table, n=2){
tn = table[1,1]
fp = table[1,2]
fn = table[2,1]
tp = table[2,2]
sensitivity = tp/(tp+fn) # recall
specificity = tn/(tn+fp)
ppp = tp/(tp+fp) # precision
npp = tn/(tn+fn)
hitrate = (tp+tn)/(tp+tn+fp+fn) # accuracy
result <- paste("Sensitivity = ", round(sensitivity, n) ,
"\nSpecificity = ", round(specificity, n),
"\nPositive Predictive Value = ", round(ppp, n),
"\nNegative Predictive Value = ", round(npp, n),
"\nAccuracy = ", round(hitrate, n), "\n", sep="")
cat(result)
}
prob <- predict(fit.logit,test_df, type="response")
logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE),
labels=c("benign", "malignant"))
logit.perf <- table(test_df$diagnosis, logit.pred,
dnn=c("Actual","Predicted"))
logit.perf
## Predicted
## Actual benign malignant
## B 95 4
## M 7 65
performance(logit.perf)
## Sensitivity = 0.9
## Specificity = 0.96
## Positive Predictive Value = 0.94
## Negative Predictive Value = 0.93
## Accuracy = 0.94
ctree.pred <- predict(fit.ctree, test_df, type = "response")
ctree.perf <- table(test_df$diagnosis, ctree.pred,
dnn=c("Actual","Predicted"))
ctree.perf
## Predicted
## Actual B M
## B 92 7
## M 7 65
performance(ctree.perf)
## Sensitivity = 0.9
## Specificity = 0.93
## Positive Predictive Value = 0.9
## Negative Predictive Value = 0.93
## Accuracy = 0.92
forest.pred <- predict(fit.forest, test_df, type = "response")
forest.perf <- table(test_df$diagnosis, forest.pred,
dnn=c("Actual","Predicted"))
forest.perf
## Predicted
## Actual B M
## B 97 2
## M 7 65
performance(forest.perf)
## Sensitivity = 0.9
## Specificity = 0.98
## Positive Predictive Value = 0.97
## Negative Predictive Value = 0.93
## Accuracy = 0.95
svm.pred <- predict(fit.svm, test_df, type = "response")
svm.perf <- table(test_df$diagnosis, svm.pred,
dnn=c("Actual","Predicted"))
svm.perf
## Predicted
## Actual B M
## B 96 3
## M 8 64
performance(svm.perf)
## Sensitivity = 0.89
## Specificity = 0.97
## Positive Predictive Value = 0.96
## Negative Predictive Value = 0.92
## Accuracy = 0.94
This is .Recommendation Data Analysis part.