Mammographic Mass Data Set

Data Set Information:

Mammography is the most effective method for breast cancer screening available today. However, the low positive predictive value of breast biopsy resulting from mammogram interpretation leads to approximately 70% unnecessary biopsies with benign outcomes. To reduce the high number of unnecessary breast biopsies, several computer-aided diagnosis (CAD) systems have been proposed in the last years.These systems help physicians in their decision to perform a breast biopsy on a suspicious lesion seen in a mammogram or to perform a short term follow-up examination instead. This data set can be used to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes and the patient’s age. It contains a BI-RADS assessment, the patient’s age and three BI-RADS attributes together with the ground truth (the severity field) for 516 benign and 445 malignant masses that have been identified on full field digital mammograms collected at the Institute of Radiology of the University Erlangen-Nuremberg between 2003 and 2006. Each instance has an associated BI-RADS assessment ranging from 1 (definitely benign) to 5 (highly suggestive of malignancy) assigned in a double-review process by physicians. Assuming that all cases with BI-RADS assessments greater or equal a given value (varying from 1 to 5), are malignant and the other cases benign, sensitivities and associated specificities can be calculated. These can be an indication of how well a CAD system performs compared to the radiologists.

Class Distribution: benign: 516; malignant: 445

Attribute Information:

6 Attributes in total (1 goal field, 1 non-predictive, 4 predictive attributes)

  1. BI-RADS assessment: 1 to 5 (ordinal, non-predictive!)
  2. Age: patient’s age in years (integer)
  3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
  4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
  5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
  6. Severity: benign=0 or malignant=1 (binominal, goal field!)

Loading the Dataset

data1<- read.csv("mammographic.data", header = FALSE, sep= "," , skip = 10, na.strings = "?")
#renaming the column names
colnames(data1)<- c("BIRADS", "Age", "Shape", "Margin", "Density", "Severity")

Displaying the summary and structure of the Dataset

summary(data1)
##      BIRADS            Age            Shape           Margin     
##  Min.   : 0.000   Min.   :18.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 4.000   1st Qu.:45.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 4.000   Median :57.00   Median :3.000   Median :3.000  
##  Mean   : 4.346   Mean   :55.48   Mean   :2.731   Mean   :2.788  
##  3rd Qu.: 5.000   3rd Qu.:66.00   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :55.000   Max.   :96.00   Max.   :4.000   Max.   :5.000  
##  NA's   :2        NA's   :5       NA's   :29      NA's   :45     
##     Density         Severity     
##  Min.   :1.000   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.:0.0000  
##  Median :3.000   Median :0.0000  
##  Mean   :2.912   Mean   :0.4616  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :1.0000  
##  NA's   :74
str(data1)
## 'data.frame':    951 obs. of  6 variables:
##  $ BIRADS  : int  5 3 4 4 4 4 3 4 4 4 ...
##  $ Age     : int  76 42 64 36 60 54 52 59 54 40 ...
##  $ Shape   : int  1 2 1 3 2 1 3 2 1 1 ...
##  $ Margin  : int  4 1 NA 1 1 1 4 1 1 NA ...
##  $ Density : int  3 3 3 2 2 3 3 3 3 NA ...
##  $ Severity: int  1 1 0 0 0 0 0 1 1 0 ...

Here we observe the data type of the features and will convert them into factors as needed and in the summary as we observed there are few NA’s values in the which we need to be removed in further pre-processing.

using the Dplyr library checking the count of each level in the BIRADS column as we have irrelevent factors defined.

data1$BIRADS<-as.factor(as.numeric(data1$BIRADS))
data1$Age<-as.numeric(data1$Age)
data1$Shape<- as.factor(as.numeric(data1$Shape))
data1$Margin<- as.factor(as.numeric(data1$Margin))
data1$Density <-  as.factor(as.numeric(data1$Density))
data1$Severity<-as.factor(data1$Severity)
levels(data1$BIRADS)
## [1] "0"  "2"  "3"  "4"  "5"  "6"  "55"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data1 %>% 
  group_by(BIRADS) %>%
  summarise(no_rows = length(BIRADS))
## # A tibble: 8 x 2
##   BIRADS no_rows
##   <fct>    <int>
## 1 0            5
## 2 2           14
## 3 3           36
## 4 4          543
## 5 5          339
## 6 6           11
## 7 55           1
## 8 <NA>         2

Here we will make the levels of the BIRADS as per the data description.

levels(data1$BIRADS) <- list("1"="0","1"="2","2"="3","3"="4","4"="5","5"="6","5"="55")
levels(data1$BIRADS)
## [1] "1" "2" "3" "4" "5"

Now we will remove the NA values present in the dataset using dplyr library. we will impute the MODE value for the the features with “factor” data type and with MEDIAN for the “numerical” data type features.

#defining mode function
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
data1=data1 %>% mutate_if(is.numeric, funs(replace(.,is.na(.), median(., na.rm = TRUE)))) %>%
  mutate_if(is.factor, funs(replace(.,is.na(.), Mode(na.omit(.)))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## i Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

Visualizations for Exploratory Data Analysis

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(data1, aes(colour = Severity, alpha = 0.4))
## 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## 
## 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## 
## 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## 
## 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## 
## 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These univariate plots give a brief description of the distribution of the features present in our dataset

library(ggplot2)
#proportion of severity in shape
ggplot(data1, aes(Shape)) + 
  geom_bar(aes(fill = Severity))

#proportion of severity in BIRADS
ggplot(data1, aes(BIRADS)) + 
  geom_bar(aes(fill = Severity))

#proportion of severity in density
ggplot(data1, aes(Density)) + 
  geom_bar(aes(fill = Severity))

#proportion of severity in margin
ggplot(data1, aes(Margin)) + 
  geom_bar(aes(fill = Severity))

Using Boxplot we can get the observation about the outliers if any present in the dataset and visualize the statistical summary of each of the feature

#boxplot
data1%>%boxplot()

ggplot(data = data1, aes(x = BIRADS, y = Age)) +
  geom_jitter(alpha = 0.1, color = "tomato") +
  geom_boxplot(alpha = 0)

ggplot(data1, aes(x = BIRADS, y = Age, group = Shape)) +
  geom_jitter() +
  facet_wrap(~ Shape)

Now we will split the dataset into training and testing dataset using the caTools library for training our classification model.

library(caTools)
set.seed(123)
split   =   sample.split(Y= data1$Severity,SplitRatio   =   0.75)
train_set=  subset(x    =data1,split    ==  TRUE)
test_set=   subset(x    =   data1,split ==  FALSE)

LOGISTIC REGRESSION MODEL

classifier  =   glm(formula =   Severity~.,
                 family =   binomial(),
                 data   =   train_set)
summary(classifier)
## 
## Call:
## glm(formula = Severity ~ ., family = binomial(), data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8807  -0.4779  -0.2240   0.4133   2.6021  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.939056   1.268262  -3.894 9.85e-05 ***
## BIRADS2     -1.782744   1.316003  -1.355 0.175524    
## BIRADS3      0.417191   0.773070   0.540 0.589435    
## BIRADS4      2.883864   0.794289   3.631 0.000283 ***
## BIRADS5      2.878831   1.404942   2.049 0.040455 *  
## Age          0.045892   0.009666   4.748 2.06e-06 ***
## Shape2      -0.365179   0.377346  -0.968 0.333167    
## Shape3       0.447728   0.446976   1.002 0.316497    
## Shape4       1.389163   0.390200   3.560 0.000371 ***
## Margin2      1.285492   0.717781   1.791 0.073305 .  
## Margin3      0.612313   0.417561   1.466 0.142539    
## Margin4      0.629656   0.352029   1.789 0.073672 .  
## Margin5      1.014750   0.451098   2.250 0.024480 *  
## Density2    -0.399715   0.927479  -0.431 0.666491    
## Density3    -0.070334   0.818579  -0.086 0.931529    
## Density4    -2.074156   1.429862  -1.451 0.146892    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 984.18  on 712  degrees of freedom
## Residual deviance: 501.95  on 697  degrees of freedom
## AIC: 533.95
## 
## Number of Fisher Scoring iterations: 6

Predicting the result on the test_set and genrating confusionMatrix to calculate the Accuracy of our logistic regression model

#Predicting
y_pred  =   predict(object  =   classifier,
                 type   =   'response',
                 newdata    =   test_set)
#y_pred
y_pred  =ifelse(y_pred>0.5,1,0)
library(caret)
## Loading required package: lattice
confusionMatrix(as.factor(y_pred),test_set$Severity)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 105  26
##          1  23  84
##                                           
##                Accuracy : 0.7941          
##                  95% CI : (0.7371, 0.8436)
##     No Information Rate : 0.5378          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5851          
##                                           
##  Mcnemar's Test P-Value : 0.7751          
##                                           
##             Sensitivity : 0.8203          
##             Specificity : 0.7636          
##          Pos Pred Value : 0.8015          
##          Neg Pred Value : 0.7850          
##              Prevalence : 0.5378          
##          Detection Rate : 0.4412          
##    Detection Prevalence : 0.5504          
##       Balanced Accuracy : 0.7920          
##                                           
##        'Positive' Class : 0               
## 

DECISION TREE

library(rpart)
fit<-rpart(formula=Severity~.,data = train_set,method = "class")

We can generate better visualization for our Decision tree model using the rpart.plot library

library(rpart.plot)
rpart.plot(fit)

predict_unseen=predict(object = fit,newdata = test_set,type = "class")
#create confusion matrix
confusionMatrix(predict_unseen,test_set$Severity)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 107  27
##          1  21  83
##                                           
##                Accuracy : 0.7983          
##                  95% CI : (0.7417, 0.8474)
##     No Information Rate : 0.5378          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5928          
##                                           
##  Mcnemar's Test P-Value : 0.4705          
##                                           
##             Sensitivity : 0.8359          
##             Specificity : 0.7545          
##          Pos Pred Value : 0.7985          
##          Neg Pred Value : 0.7981          
##              Prevalence : 0.5378          
##          Detection Rate : 0.4496          
##    Detection Prevalence : 0.5630          
##       Balanced Accuracy : 0.7952          
##                                           
##        'Positive' Class : 0               
## 

Predicting the result on the test_set and genrating confusionMatrix to calculate the Accuracy of our Decision tree model

predict_unseen=predict(object = fit,newdata = test_set,type = "class")
#create confusion matrix
confusionMatrix(predict_unseen,test_set$Severity)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 107  27
##          1  21  83
##                                           
##                Accuracy : 0.7983          
##                  95% CI : (0.7417, 0.8474)
##     No Information Rate : 0.5378          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5928          
##                                           
##  Mcnemar's Test P-Value : 0.4705          
##                                           
##             Sensitivity : 0.8359          
##             Specificity : 0.7545          
##          Pos Pred Value : 0.7985          
##          Neg Pred Value : 0.7981          
##              Prevalence : 0.5378          
##          Detection Rate : 0.4496          
##    Detection Prevalence : 0.5630          
##       Balanced Accuracy : 0.7952          
##                                           
##        'Positive' Class : 0               
## 

NAIVE BAYES

library(e1071)
classifier = naiveBayes(x= train_set[-6], y = train_set$Severity)
summary(classifier)
##           Length Class  Mode     
## apriori   2      table  numeric  
## tables    5      -none- list     
## levels    2      -none- character
## isnumeric 5      -none- logical  
## call      3      -none- call

Predicting the result on the test_set and genrating confusionMatrix to calculate the Accuracy of our Naive Bayes model

#predict
y_pred = predict(object = classifier, newdata = test_set)
confusionMatrix(as.factor(y_pred),test_set$Severity)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 102  23
##          1  26  87
##                                           
##                Accuracy : 0.7941          
##                  95% CI : (0.7371, 0.8436)
##     No Information Rate : 0.5378          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5867          
##                                           
##  Mcnemar's Test P-Value : 0.7751          
##                                           
##             Sensitivity : 0.7969          
##             Specificity : 0.7909          
##          Pos Pred Value : 0.8160          
##          Neg Pred Value : 0.7699          
##              Prevalence : 0.5378          
##          Detection Rate : 0.4286          
##    Detection Prevalence : 0.5252          
##       Balanced Accuracy : 0.7939          
##                                           
##        'Positive' Class : 0               
## 

Conclusion

Here we observe that model like Decision tree and Naive Bayes are performing with almost same accuracy which is slightly better than the Logistic Regression.