Breast cancer is one of the most dreaded and deadly cancer diagnosis that a woman can receive. For women in the U.S., breast cancer death rates are higher than those for any other cancer, besides lung cancer. Many institutions have dedicated years of research into improving the survival chances of breast cancer patients and there has been a measure of improvement in the new incidence rates since 2000. Treatment advances, earlier detection through screening, and increased awareness are all key factors in surviving breast cancer and the emergence of machine learning in medical research is an important step in detecting and predicting malignant tumors.
Each year it is estimated that over 252,710 US women will be diagnosed with breast cancer. About 1 in 8 US women will develop invasive breast cancer over the course of her lifetime. Invasive cancer, or Stage-4 breast cancer, is also called metastatic breast cancer. Metastasis happens when cancer cells migrate from the breast elsewhere in the body, triggering cancerous growth and is terminal meaning there is no cure. More than 40,000 US women a year die from metastatic breast cancer and that number has not changed since 1970. Research enabling earlier detection of malignancy is imperative to the survival of women diagnosed with breast cancer.
Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer. Dr. William H. Wolberg, a physician at the University Of Wisconsin Hospital at Madison, created a dataset using Fine Needle Aspiration biopsies to collect samples from patients with solid breast masses and a computer vision approach known as “snakes” to compute values for each of ten characteristics of each nuclei, measuring size, shape and texture. The mean, standard error and extreme values of these features are computed, resulting in a total of 30 nuclear features for each sample.
Using this dataset we can examine the observations of the biospies and investigate whether there are any variables or any combination of the variables which are predictors for a malignant or benign diagnosis.
The features in this dataset characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses. They describe characteristics of the cell nuclei present in the image. Dr. William H. Wolberg collected samples from 569 patients with solid breast masses and computed values for each of ten characteristics of each nuclei, measuring size, shape and texture including the mean, standard error and extreme values of these features.
Each case represents an individual sample or observation of tissue taken from a biopsy of a breast mass. There 569 observations in the given dataset.
The response variable is the diagnosis which is a qualitative binary categorical variable of either benign or malignant.
There are 30 independent variables which are quantitative. The variables are all aspects of the tissue samples and include the mean, standard error and worst case for each variable.
Ten real-valued features are computed for each cell nucleus:
All feature values are recoded with four significant digits. There are no missing data.
The class distribution of the data are 357 benign and 212 malignant observations.
This study is an observational study of the biopsied breast tissue mass. The samples were taken as a result of mass detection and not as a part of an experimental study. They were collected as part of a medicial procedure conducted to examine the breast mass tissue in an attempt to diagnos the mass as benign or malignant. The samples are independent of each other.
The population of interest are people who have detected breast mass and receive treatment for diagnosis. The study is a cross-sectional study (also known as a cross-sectional analysis, transverse study, prevalence study) which is a type of observational study that analyzes data from a population, or a representative subset, at a specific point in time—that is, cross-sectional data.
A cross-sectional study should be representative of the population if generalizations from the findings are to have any validity. The sample size should be sufficiently large enough to estimate the prevalence of the conditions of interest with adequate precision.
Non-response, or lack of voluntary subject participation, is a particular problem affecting cross-sectional studies and can result in bias of the measures of outcome. This is a particular problem when the characteristics of non-responders differ from responders.
No, these data be used to establish causal links between the variables of interest because of the type of study, but the findings can be used to describe the cause of the disease within the population.
Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.
There are no missing values. For some variables, there are clear differences between the distributions of the malignant and benign. There are no obvious outliers.
Logistic Regression is used for modeling when there is a categorical response variable with two levels or in other words when the dependent variable is binary. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. Logistic regression is a type of Generalized Linear Model.
Conditions for Logistic Regression:
Our null hypothesis is that none of the variables are good predictors for the diagnosis. The alternate hypothesis is that there are specific variables or a combination of variables that are good predictors for the diagnosis. We can write our hypothesis in equation form:
H0 = No variables (individually or in combination) are good predictors for benign or malignant diagnosis.
H1 = There is a specific variable or combination of variables which are good predictors for benign or malignant diagnosis.
Generalized linear models (GLMs) are an extension of linear models to model non-normal response variables. Logistic regression is for binary response variables, where there are two possible outcomes.
##
## Call:
## glm(formula = diagnosis ~ radius_mean + texture_mean + area_mean +
## compactness_mean + concavity_mean + concave_points_mean +
## perimeter_se + smoothness_se + compactness_se + concavity_se +
## concave_points_se + fractal_dimension_se + radius_worst +
## perimeter_worst + smoothness_worst + compactness_worst +
## symmetry_worst + fractal_dimension_worst, family = binomial,
## data = bc_data_no_id, control = list(maxit = 100))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.888e-05 -2.110e-08 -2.110e-08 2.110e-08 2.927e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.628e+03 1.018e+07 0.000 1.000
## radius_mean -4.178e+03 1.836e+06 -0.002 0.998
## texture_mean 2.444e+02 9.151e+04 0.003 0.998
## area_mean 3.928e+01 1.509e+04 0.003 0.998
## compactness_mean -8.448e+04 9.715e+07 -0.001 0.999
## concavity_mean 6.581e+04 7.593e+07 0.001 0.999
## concave_points_mean 6.495e+04 1.351e+08 0.000 1.000
## perimeter_se 5.870e+02 2.150e+05 0.003 0.998
## smoothness_se -5.410e+05 1.696e+08 -0.003 0.997
## compactness_se 1.112e+05 3.998e+07 0.003 0.998
## concavity_se -7.957e+04 8.142e+07 -0.001 0.999
## concave_points_se 2.492e+05 1.063e+08 0.002 0.998
## fractal_dimension_se -1.070e+06 5.642e+08 -0.002 0.998
## radius_worst 1.678e+03 1.506e+06 0.001 0.999
## perimeter_worst -1.327e+02 1.623e+05 -0.001 0.999
## smoothness_worst 4.702e+04 2.044e+07 0.002 0.998
## compactness_worst -2.443e+03 2.584e+07 0.000 1.000
## symmetry_worst 5.755e+03 1.162e+07 0.000 1.000
## fractal_dimension_worst 9.497e+04 4.798e+07 0.002 0.998
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.3470e+02 on 482 degrees of freedom
## Residual deviance: 5.3201e-09 on 464 degrees of freedom
## AIC: 38
##
## Number of Fisher Scoring iterations: 40
We can see in this plot of the deviation residuals that the model with all variables is too good of a fit and is possibly overfitting the data.
The deviance residual is useful for determining if individual points are not well fit by the model. The deviance residual for the ith observation is the signed square root of the contribution of the ith case to the sum for the model deviance, DEV .
In standard linear models, we estimate the parameters by minimizing the sum of the squared residuals, equivalent to finding parameters that maximize the likelihood. In a GLM we also fit parameters by maximizing the likelihood and is equivalent to finding parameter values that minimize the deviance.
In order to examine all possible models for these variables, we would have to creat 2^30 different model combinations which is computationally infeasible. Insteas we must choose a method for model selection taking into account the high correlation between variables.
Often we have variables that are highly correlated and therefore redundant. By eliminating highly correlated features we can avoid a predictive bias for the information contained in these features.
Akaike information criterion (AIC) is a fined technique based on in-sample fit to estimate the likelihood of a model to predict/estimate the future values. A good model is the one that has minimum AIC among all the other models. Bayesian information criterion (BIC) is another criteria for model selection that measures the trade-off between model fit and complexity of the model. A lower AIC or BIC value indicates a better fit. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
AIC basic principles:
Lower indicates a more parsimonious model, relative to a model fit with a higher AIC.
It is a relative measure of model parsimony, so it only has meaning if we compare the AIC for alternate hypotheses (= different models of the data).
The comparisons are only valid for models that are fit to the same response data (ie values of y).
You shouldn’t compare too many models with the AIC. You will run into the same problems with multiple model comparison as you would with p-values, in that you might by chance find a model with the lowest AIC, that isn’t truly the most appropriate model.
When using the AIC you might end up with multiple models that perform similarly to each other. So you have similar evidence weights for different alternate hypotheses.
Let’s take a look at AIC for each variable modeled independently against diagnosis.
Variable | AIC
------------------------ | ----------------------
radius_mean | 289.8826628
texture_mean | 545.7042499
perimeter_mean | 269.1114237
area_mean | 286.1082451
smoothness_mean | 573.9726476
compactness_mean | 439.0720172
concavity_mean | 340.3371892
concave_points_mean | 233.9324274
symmetry_mean | 581.952089
fractal_dimension_mean | 638.7013139
radius_se | 405.8715899
texture_se | 638.3764904
perimeter_se | 393.5925864
area_se | 307.2621145
smoothness_se | 634.3387307
compactness_se | 596.7520482
concavity_se | 604.7739951
concave_points_se | 556.1706583
symmetry_se | 638.6386248
fractal_dimension_se | 635.6527324
radius_worst | 205.7881332
texture_worst | 525.6797842
perimeter_worst | 192.8437216
area_worst | 207.355543
smoothness_worst | 550.1211202
compactness_worst | 434.1176233
concavity_worst | 380.1374232
concave_points_worst | 224.3003487
symmetry_worst | 546.0354419
fractal_dimension_worst | 584.9112054
A genetic algorithm is a search heuristic that is inspired by Charles Darwin’s theory of natural evolution. The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions.
Using a genetic algorith to select the variables which best predict a benign or malignant outcome:
Using a genetic algorithm, multiple models were computed to select the variables which most often appeared in the models generated. I chose the 15 variables which repeatedly were selected by the algorithm to create the final model below.
##
## Call:
## fitfunc(formula = as.formula(x), family = ..1, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.60393 -0.00067 -0.00001 0.00000 2.84477
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.236e+01 1.545e+01 -2.094 0.03622 *
## perimeter_mean -9.403e-01 4.077e-01 -2.306 0.02108 *
## compactness_mean -1.172e+02 5.283e+01 -2.218 0.02654 *
## concavity_mean 2.249e+02 7.597e+01 2.960 0.00308 **
## symmetry_mean -6.743e+01 4.968e+01 -1.357 0.17469
## area_se 2.623e-01 1.111e-01 2.361 0.01825 *
## concave_points_se 1.366e+03 5.284e+02 2.586 0.00972 **
## concavity_se -2.549e+02 9.403e+01 -2.711 0.00671 **
## fractal_dimension_se -3.238e+03 1.245e+03 -2.601 0.00930 **
## texture_worst 7.012e-01 2.166e-01 3.237 0.00121 **
## area_worst 6.923e-02 2.838e-02 2.440 0.01470 *
## symmetry_worst 4.811e+01 2.323e+01 2.072 0.03831 *
## fractal_dimension_worst 3.782e+02 1.443e+02 2.621 0.00878 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 634.705 on 482 degrees of freedom
## Residual deviance: 29.868 on 470 degrees of freedom
## AIC: 55.868
##
## Number of Fisher Scoring iterations: 13
## glmulti.analysis
## Method: g / Fitting: glm / IC used: aic
## Level: 1 / Marginality: FALSE
## From 20 models:
## Best IC: 55.8678036561631
## Best model:
## [1] "diagnosis_b ~ 1 + perimeter_mean + compactness_mean + concavity_mean + "
## [2] " symmetry_mean + area_se + concave_points_se + concavity_se + "
## [3] " fractal_dimension_se + texture_worst + area_worst + symmetry_worst + "
## [4] " fractal_dimension_worst"
## Evidence weight: 0.242344018503499
## Worst IC: 63.7382958676012
## 5 models within 2 IC units.
## 12 models to reach 95% of evidence weight.
## Convergence after 460 generations.
## Time elapsed: 1.14409490029017 minutes.
We can see that the variables are all linearly related to the logit(pi) if all other predictors are held constant this meeting the first condition for linear regression.
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 49 1
## 1 2 34
##
## Accuracy : 0.9651
## 95% CI : (0.9014, 0.9927)
## No Information Rate : 0.593
## P-Value [Acc > NIR] : 1.063e-15
##
## Kappa : 0.9281
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9608
## Specificity : 0.9714
## Pos Pred Value : 0.9800
## Neg Pred Value : 0.9444
## Prevalence : 0.5930
## Detection Rate : 0.5698
## Detection Prevalence : 0.5814
## Balanced Accuracy : 0.9661
##
## 'Positive' Class : 0
##
Final test of model validation is done by making predictions on a sample of data not used in the model building work; this is the hold-out or test set. The model is used to predict probabilities of the hold-outs observations being malignant. A threshold is chosen where probabilities above 0.5 are classified as malignant and benign otherwise. These predicted classifications are compared against actuals value using a confusion matrix and several measures of model performance, such as sensitivity (proportion of actual positives that are correctly identified) and specificity (propotion of actual negatives that are correctly identified).
For our final model validation we conclude the hypothesis test by calculating the output and the p-value comparing the full model to the null model.
## Analysis of Deviance Table
##
## Model 1: diagnosis_b ~ 1
## Model 2: diagnosis_b ~ 1 + perimeter_mean + compactness_mean + concavity_mean +
## symmetry_mean + area_se + concave_points_se + concavity_se +
## fractal_dimension_se + texture_worst + area_worst + symmetry_worst +
## fractal_dimension_worst
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 482 634.70
## 2 470 29.87 12 604.84 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that the probability is nearly zero and we therefore reject the null hypothesis.
In rejecting the null hypothesis, we found that a combination of variables predicted the diagnosis to within 96.5% accuracy. We also found that many of the variables are highly correlated, making variable selection important. A lack of domain knowledge made feature selection more difficult as well. I also learned that modeling with 30 variables is more challenging than expected.
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Creators:
Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792 wolberg ‘@’ eagle.surgery.wisc.edu
W. Nick Street, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street ‘@’ cs.wisc.edu 608-262-6619
Olvi L. Mangasarian, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi ‘@’ cs.wisc.edu
W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792–796, 1995.