Analysis of University of Wisconsin Data

January 25, 2017

Prepared by Marcel Merchat

Overview

We explore ten averaged tumor variables and a corresponding set of worst dimensions for the same variables. Our analysis quantifies how the worst dimensions of malignant tumors are greater than the worst dimensions of healthier benign ones. Next we develop a quantitative formula that predicts whether tumors are malignant or benign using a training set selected from the raw data. The remaining data is used as a validation set to evaluate how well the formula works.

We adjust the formula to balance the probability of detecting malignant tumors against the rate of false alarms. The relationship between these probabilites are illustrated by the characteristic curve in Figure-XXX. In statistical detection theory, such a plot is called a receiver operating curve (ROC). For example, in the case of finding enemy warplanes with a radar receiver system, the probability of detecting an enemy plane would be plotted against the probability of a false alarm and possibly shooting down a friendly plane. This illustrates how the probabilities of detection and false alarms are associated with different costs and should be considered separately rather than overall accuracy.

We build and test our prediction formula using the subset of observations that have 8-digit serial numbers, defining this subset as Group-8. Some insight into why the data is probed using only 8-digit serial numbers can be seen in Plot-1 and Plot-2 below, but we do not attempt to discuss this division of the raw data in depth; instead, we move forward and study Group-8; further study is needed for the other groups. Of the total of 579 observations, 70 belong to Group-8 with 49 of these randomly selected for the training group and 21 reserved for validation. While all of the raw data consists of 357 benign and 212 malignant cases, Group-8 consists of 37 benign and 33 malignant cases which indicates a higher rate of malignant cancers in Group-8 which could also require future consideration.

Raw Data

The data file is available from the University of Wisconsin server at http://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/WDBC/WDBC.dat. Since this source lacked column names, the column names were taken from an updated version that can only be manually downloaded at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. There are 32 columns of data including identification, diagnosis, and three groups of ten variables. There are a total of 579 rows of observation records consisting of 357 benign (B) and 212 malignant (M) cases. The statistical mean values for the ten variables below are reported in Columns 3-12.

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter² / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension (“coastline approximation” - 1)

The corresponding standard errors for the ten variables are reported in Columns 13-22 and the correspsonding worst values in Columns 23-32. For example, since Field-3 is the mean radius, Field-13 is standard error (se) of the radius and Field-23 is the worst radius.

Exploration

To simplify the model and to make more accurate predictions, ANOVA analysis instructs us to simplify the model by eliminating correlated variables. For example, the radius, perimeter, and area are highly correlated and only one of these is used for any given comparison. Similarly, concavity and concave points are correlated. Finally, each variable was separately tested for its prediction accuracy. The worst perimeter, worst concavity, and worst concave points columns were selected for exploration to be plotted against the mean compactness and mean perimeter.

Different Types of Serial Numbers

There are six different types of serial numbers in the training dataset such that some serial numbers have 4 digits, some have 5 digits, and so on up to nine digits. We will build our prediction formula using only the 8-digit type which we designate as Group-8.

Table-1: Serial Number Groups
Serial_Number_Length	Quantity
(digits)
4	4
5	12
6	162
7	46
8	35
9	6

Characteristics for Groups 4 and 9 don't match other groups

The group number indicates the number of digits in the serial number which varies from four to nine digits. Groups 4 and 9 were deleted from this study because Group-49 has a steeper fitted line than Group-8 in Plot-1 below. Group-49 consists of Group-4 and Group-9 combined into a single group.

plot of chunk plotgroup2

Plot-2 shows the data is more uniform with Group-49 deleted. Notice that Group-8 in brown seems to represent typical members with this change. For the remainder of this study, we focus on Group-8.

Exploration of Group-8

Separating Benign (B) and Malignant (M) Clusters

Method-1: k-means

A plot of the worse concave points versus mean compactness appears to separate the malignant and benign data points into two loose clusters in Plot-3 below. The center of the benign (B) and malignant (M) clusters was determined using the kmeans function and are indicated by crosses in the figure. To help predict the diagnosis, the assigned cluster group is added to the raw data frame. The fitted lines for the clusters are distinct from one another with the line through the malignant points higher in the plot. This indicates greater worst concave points dimensions for a given degree of tumor compactness.

Method-2: Adjustment of Prediction with Improvised Rule

A plot of worst Concavity versus mean perimeter also divides the data into two groups but this time the nature of the separation does not lend itself to k-means defined clusters of benign and malignant points because the perimeter is a regressor for concavity and the range of the mean perimeter of malignant tumors covers some of the range for benign tumors. But we can devise some improvised formulas that isolate the benign tumors at the lower left corner of Plot-4 as follows.

Rule-1: The tumors are all benign if the worst concavity is below 100-mm and within 0.05-mm of the regression line or lower.

Rule-2: Otherwise assume tumors are malignant.

plot of chunk plotgroup4

Adjusting Detection Power versus False Alarms (ROC Curve) with Improvised Rules

Rule-2 rule is defined by the blue lines in Plot-4. It helps predict if tumors are benign or malignant but it also also provides a way to adjust the automated predictions provided by the carrot package in order to optimize the power of detecting existing malignant tumors at the cost of producing more false alarms. This is an inherent tradeoff between detection power and number of false alarms when the tumors are not malignant and generally lowers the overall accuracy, but our goal is to optimize the overall cost of missed detections and false alarms instead of ptimizing an overall accuracy figure.

Regression Analysis Help

If the data for benign and malignant cases appear mixed together or cover the same range, we might still be able to separate them using regression. In Plot-5, the worst perimeter measurements for the two groups overlap; but after regression against the mean tumor radius is applied in Plot-6, benign and malignant cases separate into distinct groups.

plot of chunk misc_plots

Prediction Model

After an initial attempt to build a prediction model for malignant tumors with all the data and noticing that there were subgroups with different characteristics we built a model based on the 70 rows of Group-8 to explore what can be learned. Analysis of the other groups should also be made and compared with Group-8. Model building process began using the mean variables reported in Columns 3-12 and the largest values reported in the third group in Columns 23-32. The standard errors reported in Columns 13-22 could be considered in a further inquiry.

The data was divided into model building and validation sections. The build section was then further divided using a two-fold partition for training and model tryout testing. The random forest method was selected.

Data fields that were more than 90% correlation with the radius_mean parameter were eliminated from the model. All of the data fields were separately tested for their prediction accuracy for Benin or Malignant diagnosis in Column-2. Fields with sensitivity or selectivity accuracy below 60% were eliminated from the model.

Training Test Results

Table-2: Results for Test Set
test_predictions	Actual_Diagnosis	Testing_Accuracy
M	M	TRUE
M	M	TRUE
B	B	TRUE
B	B	TRUE
M	M	TRUE
B	B	TRUE
B	B	TRUE
B	B	TRUE
M	M	TRUE
B	B	TRUE
B	B	TRUE
B	B	TRUE
B	B	TRUE
M	M	TRUE
B	B	TRUE
B	B	TRUE
B	B	TRUE
B	M	FALSE

Performance

Detection Power versus False Alarms (ROC Curve)

This is an inherent tradeoff between detection power and the number of false alarms. This curve was generated using R Tools carrot, pROC, plotROC using the train function output for the prediction model. The “pred” list item of the output contains the data frame for the curve.

Sensitivity: The probability of a positive test result if the disease is present

Specificity: The probability of negative test if the disease is not present

Probability of False Alarm = 1 - Specificity

Rather than overall accuracy, we often need to focus on the costs associated with sensitivity and the selectivity independently. For example, in the case of a defensive radar system, the selectivity determines the probability shooting down your own plane and injuring a friendly pilot. It would be a mistake to focus on a more general overall accuracy that mixes sensitivity and selectivity into a single parameter when the costs of failing to detect a condition (sensitivity) and false alerts (the complement of sensitivity) are separately important with their own costs. Similarly, in the case of diagnosing cancer, the cost of failing to detect it and the cost of false alarms should be considered separately.

To help control our model, the cluster identification for k-means for worst concave points as shown in Plot-3 for the training set and worst concavity according to the blue lines in Plot-4 were added to the data frame. A further enhancement might include worst perimeter analysis from Plot-6. Adding extra parameters should help adjust the ROC curve by raising the detection power for malignant tumors in order to optimize the power of detecting existing malignant tumors at the cost of producing more false alarms.

plot of chunk roc_curve

Simplified ROC Curve for Worst Concave Points Test Set Results

This characteristic curve is based only based on the worst concave points of the test.

plot of chunk trainingroc

Confusion Matrix

The table below describes the results for the test set.

Table-3: Prediction Accuracy for Test Set
Parameter	Test	Diagnosis	Counts	Correct_Detections	Accuracy
Detection Power	Sensitivity	Malignant	6	5	0.83
False Alarm	1 - Specificity	Benign	12	12	1.00

Code for this reproducible report is available at https://github.com/marcelMerchat/cancer.