Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.
In this paper we will discuss a diagnosis technique that uses the FNA (Fine Needle Aspiration), which is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst with a fine needle similar to a blood sample needle.
This is an analysis of the Breast Cancer Wisconsin (Diagnostic) DataSet, obtained from Kaggle. This data set was created by Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin,USA. To create the dataset Dr. Wolberg used ???uid samples, taken from patients with solid breast masses and an easy-to-use graphical computer program called Xcyt, which is capable of perform the analysis of cytological features based on a digital scan. The program uses a curve-???tting algorithm, to compute ten features from each one of the cells in the sample, than it calculates the mean value, extreme value and standard error of each feature for the image, returninga 30 real-valuated vector
Attribute Information:
Ten real-valued features are computed for each cell nucleus:
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
This analysis aims to observe which features are most helpful in predicting malignant or benign cancer and to see general trends that may aid us in model selection and hyper parameter selection in future analysis.
The first step is to visually inspect the new data set.
## 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ X : logi NA NA NA NA NA NA ...
This exploration will focus on 11 of the 32 variables presented in this dataset: diagnosis (“M”/“B”), radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave.points_mean, symmetry_mean and fractal_dimension_mean.
## diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 M 17.99 10.38 122.80 1001.0
## 2 M 20.57 17.77 132.90 1326.0
## 3 M 19.69 21.25 130.00 1203.0
## 4 M 11.42 20.38 77.58 386.1
## 5 M 20.29 14.34 135.10 1297.0
## 6 M 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean
## 1 0.2419 0.07871
## 2 0.1812 0.05667
## 3 0.2069 0.05999
## 4 0.2597 0.09744
## 5 0.1809 0.05883
## 6 0.2087 0.07613
Let’s check for missing variables:
## diagnosis radius_mean texture_mean
## 0 0 0
## perimeter_mean area_mean smoothness_mean
## 0 0 0
## compactness_mean concavity_mean concave.points_mean
## 0 0 0
## symmetry_mean fractal_dimension_mean
## 0 0
Missing values: none
Now that we have a good intuitive sense of the data, Next step involves taking a closer look at attributes and data values
## diagnosis radius_mean texture_mean perimeter_mean
## B:357 Min. : 6.981 Min. : 9.71 Min. : 43.79
## M:212 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.20120 Max. :0.3040 Max. :0.09744
In the results displayed, you can see the data has 569 records, each with 11 columns.
Diagnosis is a categorical variable.
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
One of the main goals of visualizing the data here is to observe which features are most helpful in predicting malignant or benign cancer. The other is to see general trends that may aid us in model selection and hyper parameter selection.
Apply 3 techniques to understand each attribute of our dataset independently. Histograms. Density Plots. Box and Whisker Plots.
first lets get the frequency of cancer diagnosis
M= Malignant (indicates prescence of cancer cells) B= Benign (indicates abscence) 357 observations which account for 62.7% of all observations indicating the absence of cancer cells, 212 which account for 37.3% of all observations shows the presence of cancerous cell.
We can see that perhaps the attributes concavity,and concavity_point may have an exponential distribution. We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.
| feature | skewness |
|---|---|
| radius_mean | (0.94) |
| texture_mean | (0.65) |
| perimeter_mean | (0.99) |
| area_mean | (1.64) |
| smoothness_mean | (0.45) |
| compactness_mean | (1.18) |
| concavity_mean | (1.39) |
| concave.points_mean | (1.17) |
| symmetry_mean | (0.72) |
| fractal_dimension_mean | (1.3) |
The skew result show a positive (right) skew for all features. Values closer to zero show less skew.
| feature | kurtosis |
|---|---|
| radius_mean | (0.81) |
| texture_mean | (0.73) |
| perimeter_mean | (0.94) |
| area_mean | (3.59) |
| smoothness_mean | (0.82) |
| compactness_mean | (1.61) |
| concavity_mean | (1.95) |
| concave.points_mean | (1.03) |
| symmetry_mean | (1.25) |
| fractal_dimension_mean | (2.95) |
Comparison of radius distribution by malignancy shows that larger values of radius_mean, perimeter_mean, area_mean, concavity_mean and concave.points_mean tends to show a correlation with malignant tumors.
We can see that perhaps the attributes perimeter,radius, area, concavity,compactness may have an exponential distribution. We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution.
Looking at the different boxplots it looks like area_mean and fractal_dimension_mean have large number of outlier points which are determined as the values above 1.5*interquartile range (IQR). These outlier values represent a group of measurements which are much higher than the average.
Let’s see if it warrants further cleanup.
Outlier check for area_mean:
## Outliers identified: 25 Propotion (%) of outliers: 4.6 Mean of the outliers: 1670.52 Mean without removing outliers: 654.89 Mean if we remove outliers: 608.21
Proportion of outliers is less than five percent (4.6%), not required further cleanup.
Outlier check for fractal_dimension_mean:
## Outliers identified: 15 Propotion (%) of outliers: 2.7 Mean of the outliers: 0.09 Mean without removing outliers: 0.06 Mean if we remove outliers: 0.06
Proportion of outliers is less than five percent (2.7%), no effect on the mean, not required further cleanup.
Boxplots are another way of showing the various attributes. With the help of these, we can understand what are the median values for Benign or Malignant cases. e.g, in the case of “area_mean”, the median value is 500 for Benign cases, and 1000 for Malignant ones. The dots tells that there are some values which are outliers, exception cases. The observation here is that, the median value of various variables is much higher in Malignant cases.
We are also interested in how the 10 predictors relate to each other. To see bivariate relationships among these 10 predictors, we will look at correlations and scatterplots.
Looking at the data the first thing we notice is that it contains a number of correlated predictors. We can see strong positive relationship exists with mean values paramaters between 1-0.75;
The mean area of the tissue nucleus has a strong positive correlation with mean values of radius and parameter; Some paramters are moderately positive corrlated (r between 0.5-0.75), those are concavity and area, concavity and perimeter etc *Likewise, we see some strong negative correlation between fractal_dimension with radius, texture, parameter mean values.
A matrix of the visual representation of the relationship between the 6 most highly correlated variables: 1. radius_mean 2. parameter_mean
3. area_mean 4. compactness_mean 5. concavity_mean
6. concave_points_mean
we can clearly see that we can easily distinguish the difference between Malignant and Benign. Most of the benign observations are centered in the left lower quadrant of the graph while the malignant observations are centered in the right upper quadrant. As well as some variable interactions have an almost linear relationship.
Mean values of cell radius, perimeter, area, compactness, concavity and concave points can be used in classification of the cancer. Larger values of these parameters tends to show a correlation with malignant tumors. mean values of texture, smoothness, symmetry or fractual dimension does not show a particular preference of one diagnosis over the other. In any of the histograms there are no noticeable large outliers that warrants further cleanup.