Part 1 - Introduction

Breast cancer is one of the most dreaded and deadly cancer diagnosis that a woman can receive. For women in the U.S., breast cancer death rates are higher than those for any other cancer, besides lung cancer. Many institutions have dedicated years of research into improving the survival chances of breast cancer patients and there has been a measure of improvement in the new incidence rates since 2000. Treatment advances, earlier detection through screening, and increased awareness are all key factors in surviving breast cancer and the emergence of machine learning in medical research is an important step in detecting and predicting malignant tumors.

Each year it is estimated that over 252,710 US women will be diagnosed with breast cancer. About 1 in 8 US women will develop invasive breast cancer over the course of her lifetime. Invasive cancer, or Stage-4 breast cancer, is also called metastatic breast cancer. Metastasis happens when cancer cells migrate from the breast elsewhere in the body, triggering cancerous growth and is terminal meaning there is no cure. More than 40,000 US women a year die from metastatic breast cancer and that number has not changed since 1970. Research enabling earlier detection of malignancy is imperative to the survival of women diagnosed with breast cancer.

Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer. Dr. William H. Wolberg, a physician at the University Of Wisconsin Hospital at Madison, created a dataset using Fine Needle Aspiration biopsies to collect samples from patients with solid breast masses and a computer vision approach known as “snakes” to compute values for each of ten characteristics of each nuclei, measuring size, shape and texture. The mean, standard error and extreme values of these features are computed, resulting in a total of 30 nuclear features for each sample.

Using this dataset we can examine the observations of the biospies and investigate whether there are any variables or any combination of the variables which are predictors for a malignant or benign diagnosis.

Part 2 - Data

Data collection:

The features in this dataset characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses. They describe characteristics of the cell nuclei present in the image. Dr. William H. Wolberg collected samples from 569 patients with solid breast masses and computed values for each of ten characteristics of each nuclei, measuring size, shape and texture including the mean, standard error and extreme values of these features.

Cases:

Each case represents an individual sample or observation of tissue taken from a biopsy of a breast mass. There 569 observations in the given dataset.

Variables:

Dependent Variable

The response variable is the diagnosis which is a qualitative binary categorical variable of either benign or malignant.

Independent Variables

There are 30 independent variables which are quantitative. The variables are all aspects of the tissue samples and include the mean, standard error and worst case for each variable.

Ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

All feature values are recoded with four significant digits. There are no missing data.

The class distribution of the data are 357 benign and 212 malignant observations.

Type of study:

This study is an observational study of the biopsied breast tissue mass. The samples were taken as a result of mass detection and not as a part of an experimental study. They were collected as part of a medicial procedure conducted to examine the breast mass tissue in an attempt to diagnos the mass as benign or malignant. The samples are independent of each other.

Scope of inference - generalizability:

The population of interest are people who have detected breast mass and receive treatment for diagnosis. The study is a cross-sectional study (also known as a cross-sectional analysis, transverse study, prevalence study) which is a type of observational study that analyzes data from a population, or a representative subset, at a specific point in time—that is, cross-sectional data.

A cross-sectional study should be representative of the population if generalizations from the findings are to have any validity. The sample size should be sufficiently large enough to estimate the prevalence of the conditions of interest with adequate precision.

Non-response, or lack of voluntary subject participation, is a particular problem affecting cross-sectional studies and can result in bias of the measures of outcome. This is a particular problem when the characteristics of non-responders differ from responders.

Scope of inference - causality:

No, these data be used to establish causal links between the variables of interest because of the type of study, but the findings can be used to describe the cause of the disease within the population.

Part 3 - Exploratory Data Analysis

Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.

Boxplots of the 30 Variables vs. Diagnosis:

There are no missing values. For some variables, there are clear differences between the distributions of the malignant and benign. There are no obvious outliers.

Part 4 - Inference

Logistic Regression is used for modeling when there is a categorical response variable with two levels or in other words when the dependent variable is binary. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. Logistic regression is a type of Generalized Linear Model.

Check Conditions

Conditions for Logistic Regression:

  • Each predictor is linearly related to the logit(pi) if all other predictors are held constant. See below for linearity test.
  • Each outcome Yi is independent of the other outcomes. The samples are all taken independently of each other and therefore the data are independent.

Theoretical Inference -

Hypothesis Test and Confidence Intervals

Our null hypothesis is that none of the variables are good predictors for the diagnosis. The alternate hypothesis is that there are specific variables or a combination of variables that are good predictors for the diagnosis. We can write our hypothesis in equation form:

H0 = No variables (individually or in combination) are good predictors for benign or malignant diagnosis.

H1 = There is a specific variable or combination of variables which are good predictors for benign or malignant diagnosis.

GLM for all Variables

Generalized linear models (GLMs) are an extension of linear models to model non-normal response variables. Logistic regression is for binary response variables, where there are two possible outcomes.

Plot of the residuals for the GLM Model with all Variables

## 
## Call:
## glm(formula = diagnosis ~ radius_mean + texture_mean + area_mean + 
##     compactness_mean + concavity_mean + concave_points_mean + 
##     perimeter_se + smoothness_se + compactness_se + concavity_se + 
##     concave_points_se + fractal_dimension_se + radius_worst + 
##     perimeter_worst + smoothness_worst + compactness_worst + 
##     symmetry_worst + fractal_dimension_worst, family = binomial, 
##     data = bc_data_no_id, control = list(maxit = 100))
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.888e-05  -2.110e-08  -2.110e-08   2.110e-08   2.927e-05  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)              4.628e+03  1.018e+07   0.000    1.000
## radius_mean             -4.178e+03  1.836e+06  -0.002    0.998
## texture_mean             2.444e+02  9.151e+04   0.003    0.998
## area_mean                3.928e+01  1.509e+04   0.003    0.998
## compactness_mean        -8.448e+04  9.715e+07  -0.001    0.999
## concavity_mean           6.581e+04  7.593e+07   0.001    0.999
## concave_points_mean      6.495e+04  1.351e+08   0.000    1.000
## perimeter_se             5.870e+02  2.150e+05   0.003    0.998
## smoothness_se           -5.410e+05  1.696e+08  -0.003    0.997
## compactness_se           1.112e+05  3.998e+07   0.003    0.998
## concavity_se            -7.957e+04  8.142e+07  -0.001    0.999
## concave_points_se        2.492e+05  1.063e+08   0.002    0.998
## fractal_dimension_se    -1.070e+06  5.642e+08  -0.002    0.998
## radius_worst             1.678e+03  1.506e+06   0.001    0.999
## perimeter_worst         -1.327e+02  1.623e+05  -0.001    0.999
## smoothness_worst         4.702e+04  2.044e+07   0.002    0.998
## compactness_worst       -2.443e+03  2.584e+07   0.000    1.000
## symmetry_worst           5.755e+03  1.162e+07   0.000    1.000
## fractal_dimension_worst  9.497e+04  4.798e+07   0.002    0.998
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.3470e+02  on 482  degrees of freedom
## Residual deviance: 5.3201e-09  on 464  degrees of freedom
## AIC: 38
## 
## Number of Fisher Scoring iterations: 40

We can see in this plot of the deviation residuals that the model with all variables is too good of a fit and is possibly overfitting the data.

Deviance Residual Plots for all Variables Modeled Independently Against Diagnosis

The deviance residual is useful for determining if individual points are not well fit by the model. The deviance residual for the ith observation is the signed square root of the contribution of the ith case to the sum for the model deviance, DEV .

In standard linear models, we estimate the parameters by minimizing the sum of the squared residuals, equivalent to finding parameters that maximize the likelihood. In a GLM we also fit parameters by maximizing the likelihood and is equivalent to finding parameter values that minimize the deviance.

Methods for Selecting Variables

In order to examine all possible models for these variables, we would have to creat 2^30 different model combinations which is computationally infeasible. Insteas we must choose a method for model selection taking into account the high correlation between variables.

Examine the Correlation of the Variables

Often we have variables that are highly correlated and therefore redundant. By eliminating highly correlated features we can avoid a predictive bias for the information contained in these features.

These are the variables that are highly correlated:
##  [1] "concavity_mean"          "concave_points_mean"    
##  [3] "compactness_mean"        "concave_points_worst"   
##  [5] "concavity_worst"         "perimeter_worst"        
##  [7] "radius_worst"            "compactness_worst"      
##  [9] "perimeter_mean"          "area_worst"             
## [11] "radius_mean"             "perimeter_se"           
## [13] "radius_se"               "compactness_se"         
## [15] "concave_points_se"       "area_se"                
## [17] "concavity_se"            "symmetry_mean"          
## [19] "smoothness_mean"         "fractal_dimension_worst"
## [21] "fractal_dimension_mean"  "texture_worst"

Correlations between all features are calculated and visualised with the corrplot package. We could consider removing all features with a correlation higher than 0.7 as a means of variable selection. But for now let’s explore other options for variable selection.

AIC - Akaike Information Criterion

Akaike information criterion (AIC) is a fined technique based on in-sample fit to estimate the likelihood of a model to predict/estimate the future values. A good model is the one that has minimum AIC among all the other models. Bayesian information criterion (BIC) is another criteria for model selection that measures the trade-off between model fit and complexity of the model. A lower AIC or BIC value indicates a better fit. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

AIC basic principles:

  • Lower indicates a more parsimonious model, relative to a model fit with a higher AIC.

  • It is a relative measure of model parsimony, so it only has meaning if we compare the AIC for alternate hypotheses (= different models of the data).

  • The comparisons are only valid for models that are fit to the same response data (ie values of y).

  • You shouldn’t compare too many models with the AIC. You will run into the same problems with multiple model comparison as you would with p-values, in that you might by chance find a model with the lowest AIC, that isn’t truly the most appropriate model.

  • When using the AIC you might end up with multiple models that perform similarly to each other. So you have similar evidence weights for different alternate hypotheses.

Let’s take a look at AIC for each variable modeled independently against diagnosis.

Table of AIC for All Variables Modeled Independently against Diagnosis

             Variable    |      AIC
------------------------ | ----------------------
             radius_mean | 289.8826628
            texture_mean | 545.7042499   
          perimeter_mean | 269.1114237
               area_mean | 286.1082451
         smoothness_mean | 573.9726476
        compactness_mean | 439.0720172
          concavity_mean | 340.3371892    
     concave_points_mean | 233.9324274
           symmetry_mean | 581.952089
  fractal_dimension_mean | 638.7013139
               radius_se | 405.8715899
              texture_se | 638.3764904
            perimeter_se | 393.5925864
                 area_se | 307.2621145
           smoothness_se | 634.3387307
          compactness_se | 596.7520482 
            concavity_se | 604.7739951       
       concave_points_se | 556.1706583
             symmetry_se | 638.6386248 
    fractal_dimension_se | 635.6527324
            radius_worst | 205.7881332
           texture_worst | 525.6797842 
         perimeter_worst | 192.8437216 
              area_worst | 207.355543
        smoothness_worst | 550.1211202 
       compactness_worst | 434.1176233
         concavity_worst | 380.1374232      
    concave_points_worst | 224.3003487
          symmetry_worst | 546.0354419 
 fractal_dimension_worst | 584.9112054
                      

Genetic Algorithm

A genetic algorithm is a search heuristic that is inspired by Charles Darwin’s theory of natural evolution. The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions.

Using a genetic algorith to select the variables which best predict a benign or malignant outcome:

Using a genetic algorithm, multiple models were computed to select the variables which most often appeared in the models generated. I chose the 15 variables which repeatedly were selected by the algorithm to create the final model below.

Summary and Plots of Final Model Selected

## 
## Call:
## fitfunc(formula = as.formula(x), family = ..1, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.60393  -0.00067  -0.00001   0.00000   2.84477  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             -3.236e+01  1.545e+01  -2.094  0.03622 * 
## perimeter_mean          -9.403e-01  4.077e-01  -2.306  0.02108 * 
## compactness_mean        -1.172e+02  5.283e+01  -2.218  0.02654 * 
## concavity_mean           2.249e+02  7.597e+01   2.960  0.00308 **
## symmetry_mean           -6.743e+01  4.968e+01  -1.357  0.17469   
## area_se                  2.623e-01  1.111e-01   2.361  0.01825 * 
## concave_points_se        1.366e+03  5.284e+02   2.586  0.00972 **
## concavity_se            -2.549e+02  9.403e+01  -2.711  0.00671 **
## fractal_dimension_se    -3.238e+03  1.245e+03  -2.601  0.00930 **
## texture_worst            7.012e-01  2.166e-01   3.237  0.00121 **
## area_worst               6.923e-02  2.838e-02   2.440  0.01470 * 
## symmetry_worst           4.811e+01  2.323e+01   2.072  0.03831 * 
## fractal_dimension_worst  3.782e+02  1.443e+02   2.621  0.00878 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 634.705  on 482  degrees of freedom
## Residual deviance:  29.868  on 470  degrees of freedom
## AIC: 55.868
## 
## Number of Fisher Scoring iterations: 13

## glmulti.analysis
## Method: g / Fitting: glm / IC used: aic
## Level: 1 / Marginality: FALSE
## From 20 models:
## Best IC: 55.8678036561631
## Best model:
## [1] "diagnosis_b ~ 1 + perimeter_mean + compactness_mean + concavity_mean + "  
## [2] "    symmetry_mean + area_se + concave_points_se + concavity_se + "        
## [3] "    fractal_dimension_se + texture_worst + area_worst + symmetry_worst + "
## [4] "    fractal_dimension_worst"                                              
## Evidence weight: 0.242344018503499
## Worst IC: 63.7382958676012
## 5 models within 2 IC units.
## 12 models to reach 95% of evidence weight.
## Convergence after 460 generations.
## Time elapsed: 1.14409490029017 minutes.

Linearity Assumption Check

We can see that the variables are all linearly related to the logit(pi) if all other predictors are held constant this meeting the first condition for linear regression.

Model Probability

## Confusion Matrix and Statistics
## 
##    
##      0  1
##   0 49  1
##   1  2 34
##                                           
##                Accuracy : 0.9651          
##                  95% CI : (0.9014, 0.9927)
##     No Information Rate : 0.593           
##     P-Value [Acc > NIR] : 1.063e-15       
##                                           
##                   Kappa : 0.9281          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9608          
##             Specificity : 0.9714          
##          Pos Pred Value : 0.9800          
##          Neg Pred Value : 0.9444          
##              Prevalence : 0.5930          
##          Detection Rate : 0.5698          
##    Detection Prevalence : 0.5814          
##       Balanced Accuracy : 0.9661          
##                                           
##        'Positive' Class : 0               
## 

Final test of model validation is done by making predictions on a sample of data not used in the model building work; this is the hold-out or test set. The model is used to predict probabilities of the hold-outs observations being malignant. A threshold is chosen where probabilities above 0.5 are classified as malignant and benign otherwise. These predicted classifications are compared against actuals value using a confusion matrix and several measures of model performance, such as sensitivity (proportion of actual positives that are correctly identified) and specificity (propotion of actual negatives that are correctly identified).

Hypothesis Test

For our final model validation we conclude the hypothesis test by calculating the output and the p-value comparing the full model to the null model.

## Analysis of Deviance Table
## 
## Model 1: diagnosis_b ~ 1
## Model 2: diagnosis_b ~ 1 + perimeter_mean + compactness_mean + concavity_mean + 
##     symmetry_mean + area_se + concave_points_se + concavity_se + 
##     fractal_dimension_se + texture_worst + area_worst + symmetry_worst + 
##     fractal_dimension_worst
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       482     634.70                          
## 2       470      29.87 12   604.84 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the probability is nearly zero and we therefore reject the null hypothesis.

Brief description of methodology that reflects your conceptual understanding

  • Identified and collected a dataset to perform analysis
  • Split the data into model build and hold-out (i.e., test) sets.
  • Retain the test set as a final check on model performance.
  • Exploratory analysis
  • Identify independent (predictor) variable correlations
  • Perform model selection using univariate predictor selection
  • Perform model selection using stepwise predictor selection
  • Perform model selection using genetic algorithms, a heuristic approach to model selection
  • Compare multiple model selection approaches, synthesizing a single model from all prior approaches
  • Validate model compared to null model (hypothesis)
  • Check final model performance on hold-out data set

Part 5 - Conclusion

In rejecting the null hypothesis, we found that a combination of variables predicted the diagnosis to within 96.5% accuracy. We also found that many of the variables are highly correlated, making variable selection important. A lack of domain knowledge made feature selection more difficult as well. I also learned that modeling with 30 variables is more challenging than expected.

References

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Creators:

  1. Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792 wolberg ‘@’ eagle.surgery.wisc.edu

  2. W. Nick Street, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street ‘@’ cs.wisc.edu 608-262-6619

  3. Olvi L. Mangasarian, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi ‘@’ cs.wisc.edu

W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.

O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792–796, 1995.