Data Set Information:
Mammography is the most effective method for breast cancer screening available today. However, the low positive predictive value of breast biopsy resulting from mammogram interpretation leads to approximately 70% unnecessary biopsies with benign outcomes. To reduce the high number of unnecessary breast biopsies, several computer-aided diagnosis (CAD) systems have been proposed in the last years.These systems help physicians in their decision to perform a breast biopsy on a suspicious lesion seen in a mammogram or to perform a short term follow-up examination instead. This data set can be used to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes and the patient’s age. It contains a BI-RADS assessment, the patient’s age and three BI-RADS attributes together with the ground truth (the severity field) for 516 benign and 445 malignant masses that have been identified on full field digital mammograms collected at the Institute of Radiology of the University Erlangen-Nuremberg between 2003 and 2006. Each instance has an associated BI-RADS assessment ranging from 1 (definitely benign) to 5 (highly suggestive of malignancy) assigned in a double-review process by physicians. Assuming that all cases with BI-RADS assessments greater or equal a given value (varying from 1 to 5), are malignant and the other cases benign, sensitivities and associated specificities can be calculated. These can be an indication of how well a CAD system performs compared to the radiologists.
Class Distribution: benign: 516; malignant: 445
Attribute Information:
6 Attributes in total (1 goal field, 1 non-predictive, 4 predictive attributes)
data1<- read.csv("mammographic.data", header = FALSE, sep= "," , skip = 10, na.strings = "?")
#renaming the column names
colnames(data1)<- c("BIRADS", "Age", "Shape", "Margin", "Density", "Severity")
summary(data1)
## BIRADS Age Shape Margin
## Min. : 0.000 Min. :18.00 Min. :1.000 Min. :1.000
## 1st Qu.: 4.000 1st Qu.:45.00 1st Qu.:2.000 1st Qu.:1.000
## Median : 4.000 Median :57.00 Median :3.000 Median :3.000
## Mean : 4.346 Mean :55.48 Mean :2.731 Mean :2.788
## 3rd Qu.: 5.000 3rd Qu.:66.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :55.000 Max. :96.00 Max. :4.000 Max. :5.000
## NA's :2 NA's :5 NA's :29 NA's :45
## Density Severity
## Min. :1.000 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:0.0000
## Median :3.000 Median :0.0000
## Mean :2.912 Mean :0.4616
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :4.000 Max. :1.0000
## NA's :74
str(data1)
## 'data.frame': 951 obs. of 6 variables:
## $ BIRADS : int 5 3 4 4 4 4 3 4 4 4 ...
## $ Age : int 76 42 64 36 60 54 52 59 54 40 ...
## $ Shape : int 1 2 1 3 2 1 3 2 1 1 ...
## $ Margin : int 4 1 NA 1 1 1 4 1 1 NA ...
## $ Density : int 3 3 3 2 2 3 3 3 3 NA ...
## $ Severity: int 1 1 0 0 0 0 0 1 1 0 ...
Here we observe the data type of the features and will convert them into factors as needed and in the summary as we observed there are few NA’s values in the which we need to be removed in further pre-processing.
using the Dplyr library checking the count of each level in the BIRADS column as we have irrelevent factors defined.
data1$BIRADS<-as.factor(as.numeric(data1$BIRADS))
data1$Age<-as.numeric(data1$Age)
data1$Shape<- as.factor(as.numeric(data1$Shape))
data1$Margin<- as.factor(as.numeric(data1$Margin))
data1$Density <- as.factor(as.numeric(data1$Density))
data1$Severity<-as.factor(data1$Severity)
levels(data1$BIRADS)
## [1] "0" "2" "3" "4" "5" "6" "55"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data1 %>%
group_by(BIRADS) %>%
summarise(no_rows = length(BIRADS))
## # A tibble: 8 x 2
## BIRADS no_rows
## <fct> <int>
## 1 0 5
## 2 2 14
## 3 3 36
## 4 4 543
## 5 5 339
## 6 6 11
## 7 55 1
## 8 <NA> 2
Here we will make the levels of the BIRADS as per the data description.
levels(data1$BIRADS) <- list("1"="0","1"="2","2"="3","3"="4","4"="5","5"="6","5"="55")
levels(data1$BIRADS)
## [1] "1" "2" "3" "4" "5"
Now we will remove the NA values present in the dataset using dplyr library. we will impute the MODE value for the the features with “factor” data type and with MEDIAN for the “numerical” data type features.
#defining mode function
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
data1=data1 %>% mutate_if(is.numeric, funs(replace(.,is.na(.), median(., na.rm = TRUE)))) %>%
mutate_if(is.factor, funs(replace(.,is.na(.), Mode(na.omit(.)))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## i Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(data1, aes(colour = Severity, alpha = 0.4))
##
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
##
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
##
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
##
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
##
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These univariate plots give a brief description of the distribution of the features present in our dataset
library(ggplot2)
#proportion of severity in shape
ggplot(data1, aes(Shape)) +
geom_bar(aes(fill = Severity))
#proportion of severity in BIRADS
ggplot(data1, aes(BIRADS)) +
geom_bar(aes(fill = Severity))
#proportion of severity in density
ggplot(data1, aes(Density)) +
geom_bar(aes(fill = Severity))
#proportion of severity in margin
ggplot(data1, aes(Margin)) +
geom_bar(aes(fill = Severity))
Using Boxplot we can get the observation about the outliers if any present in the dataset and visualize the statistical summary of each of the feature
#boxplot
data1%>%boxplot()
ggplot(data = data1, aes(x = BIRADS, y = Age)) +
geom_jitter(alpha = 0.1, color = "tomato") +
geom_boxplot(alpha = 0)
ggplot(data1, aes(x = BIRADS, y = Age, group = Shape)) +
geom_jitter() +
facet_wrap(~ Shape)
Now we will split the dataset into training and testing dataset using the caTools library for training our classification model.
library(caTools)
set.seed(123)
split = sample.split(Y= data1$Severity,SplitRatio = 0.75)
train_set= subset(x =data1,split == TRUE)
test_set= subset(x = data1,split == FALSE)
classifier = glm(formula = Severity~.,
family = binomial(),
data = train_set)
summary(classifier)
##
## Call:
## glm(formula = Severity ~ ., family = binomial(), data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8807 -0.4779 -0.2240 0.4133 2.6021
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.939056 1.268262 -3.894 9.85e-05 ***
## BIRADS2 -1.782744 1.316003 -1.355 0.175524
## BIRADS3 0.417191 0.773070 0.540 0.589435
## BIRADS4 2.883864 0.794289 3.631 0.000283 ***
## BIRADS5 2.878831 1.404942 2.049 0.040455 *
## Age 0.045892 0.009666 4.748 2.06e-06 ***
## Shape2 -0.365179 0.377346 -0.968 0.333167
## Shape3 0.447728 0.446976 1.002 0.316497
## Shape4 1.389163 0.390200 3.560 0.000371 ***
## Margin2 1.285492 0.717781 1.791 0.073305 .
## Margin3 0.612313 0.417561 1.466 0.142539
## Margin4 0.629656 0.352029 1.789 0.073672 .
## Margin5 1.014750 0.451098 2.250 0.024480 *
## Density2 -0.399715 0.927479 -0.431 0.666491
## Density3 -0.070334 0.818579 -0.086 0.931529
## Density4 -2.074156 1.429862 -1.451 0.146892
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 984.18 on 712 degrees of freedom
## Residual deviance: 501.95 on 697 degrees of freedom
## AIC: 533.95
##
## Number of Fisher Scoring iterations: 6
Predicting the result on the test_set and genrating confusionMatrix to calculate the Accuracy of our logistic regression model
#Predicting
y_pred = predict(object = classifier,
type = 'response',
newdata = test_set)
#y_pred
y_pred =ifelse(y_pred>0.5,1,0)
library(caret)
## Loading required package: lattice
confusionMatrix(as.factor(y_pred),test_set$Severity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 105 26
## 1 23 84
##
## Accuracy : 0.7941
## 95% CI : (0.7371, 0.8436)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5851
##
## Mcnemar's Test P-Value : 0.7751
##
## Sensitivity : 0.8203
## Specificity : 0.7636
## Pos Pred Value : 0.8015
## Neg Pred Value : 0.7850
## Prevalence : 0.5378
## Detection Rate : 0.4412
## Detection Prevalence : 0.5504
## Balanced Accuracy : 0.7920
##
## 'Positive' Class : 0
##
library(rpart)
fit<-rpart(formula=Severity~.,data = train_set,method = "class")
We can generate better visualization for our Decision tree model using the rpart.plot library
library(rpart.plot)
rpart.plot(fit)
predict_unseen=predict(object = fit,newdata = test_set,type = "class")
#create confusion matrix
confusionMatrix(predict_unseen,test_set$Severity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 107 27
## 1 21 83
##
## Accuracy : 0.7983
## 95% CI : (0.7417, 0.8474)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5928
##
## Mcnemar's Test P-Value : 0.4705
##
## Sensitivity : 0.8359
## Specificity : 0.7545
## Pos Pred Value : 0.7985
## Neg Pred Value : 0.7981
## Prevalence : 0.5378
## Detection Rate : 0.4496
## Detection Prevalence : 0.5630
## Balanced Accuracy : 0.7952
##
## 'Positive' Class : 0
##
Predicting the result on the test_set and genrating confusionMatrix to calculate the Accuracy of our Decision tree model
predict_unseen=predict(object = fit,newdata = test_set,type = "class")
#create confusion matrix
confusionMatrix(predict_unseen,test_set$Severity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 107 27
## 1 21 83
##
## Accuracy : 0.7983
## 95% CI : (0.7417, 0.8474)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5928
##
## Mcnemar's Test P-Value : 0.4705
##
## Sensitivity : 0.8359
## Specificity : 0.7545
## Pos Pred Value : 0.7985
## Neg Pred Value : 0.7981
## Prevalence : 0.5378
## Detection Rate : 0.4496
## Detection Prevalence : 0.5630
## Balanced Accuracy : 0.7952
##
## 'Positive' Class : 0
##
library(e1071)
classifier = naiveBayes(x= train_set[-6], y = train_set$Severity)
summary(classifier)
## Length Class Mode
## apriori 2 table numeric
## tables 5 -none- list
## levels 2 -none- character
## isnumeric 5 -none- logical
## call 3 -none- call
Predicting the result on the test_set and genrating confusionMatrix to calculate the Accuracy of our Naive Bayes model
#predict
y_pred = predict(object = classifier, newdata = test_set)
confusionMatrix(as.factor(y_pred),test_set$Severity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 102 23
## 1 26 87
##
## Accuracy : 0.7941
## 95% CI : (0.7371, 0.8436)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5867
##
## Mcnemar's Test P-Value : 0.7751
##
## Sensitivity : 0.7969
## Specificity : 0.7909
## Pos Pred Value : 0.8160
## Neg Pred Value : 0.7699
## Prevalence : 0.5378
## Detection Rate : 0.4286
## Detection Prevalence : 0.5252
## Balanced Accuracy : 0.7939
##
## 'Positive' Class : 0
##
Here we observe that model like Decision tree and Naive Bayes are performing with almost same accuracy which is slightly better than the Logistic Regression.