Data Introduction

Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.

In this paper we will discuss a diagnosis technique that uses the FNA (Fine Needle Aspiration), which is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst with a fine needle similar to a blood sample needle.

This is an analysis of the Breast Cancer Wisconsin (Diagnostic) DataSet, obtained from Kaggle. This data set was created by Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin,USA. To create the dataset Dr. Wolberg used ???uid samples, taken from patients with solid breast masses and an easy-to-use graphical computer program called Xcyt, which is capable of perform the analysis of cytological features based on a digital scan. The program uses a curve-???tting algorithm, to compute ten features from each one of the cells in the sample, than it calculates the mean value, extreme value and standard error of each feature for the image, returninga 30 real-valuated vector

Attribute Information:

ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

This analysis aims to observe which features are most helpful in predicting malignant or benign cancer and to see general trends that may aid us in model selection and hyper parameter selection in future analysis.

Descriptive statistics

The first step is to visually inspect the new data set.

## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

This exploration will focus on 11 of the 32 variables presented in this dataset: diagnosis (“M”/“B”), radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave.points_mean, symmetry_mean and fractal_dimension_mean.

##   diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1         M       17.99        10.38         122.80    1001.0
## 2         M       20.57        17.77         132.90    1326.0
## 3         M       19.69        21.25         130.00    1203.0
## 4         M       11.42        20.38          77.58     386.1
## 5         M       20.29        14.34         135.10    1297.0
## 6         M       12.45        15.70          82.57     477.1
##   smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
## 4         0.14250          0.28390         0.2414             0.10520
## 5         0.10030          0.13280         0.1980             0.10430
## 6         0.12780          0.17000         0.1578             0.08089
##   symmetry_mean fractal_dimension_mean
## 1        0.2419                0.07871
## 2        0.1812                0.05667
## 3        0.2069                0.05999
## 4        0.2597                0.09744
## 5        0.1809                0.05883
## 6        0.2087                0.07613

Let’s check for missing variables:

##              diagnosis            radius_mean           texture_mean 
##                      0                      0                      0 
##         perimeter_mean              area_mean        smoothness_mean 
##                      0                      0                      0 
##       compactness_mean         concavity_mean    concave.points_mean 
##                      0                      0                      0 
##          symmetry_mean fractal_dimension_mean 
##                      0                      0

Missing values: none

Now that we have a good intuitive sense of the data, Next step involves taking a closer look at attributes and data values

##  diagnosis  radius_mean      texture_mean   perimeter_mean  
##  B:357     Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  M:212     1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##            Median :13.370   Median :18.84   Median : 86.24  
##            Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##            3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##            Max.   :28.110   Max.   :39.28   Max.   :188.50  
##    area_mean      smoothness_mean   compactness_mean  concavity_mean   
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
##  concave.points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.20120     Max.   :0.3040   Max.   :0.09744

Description

In the results displayed, you can see the data has 569 records, each with 11 columns.

Diagnosis is a categorical variable.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

Univariate Plots Section

One of the main goals of visualizing the data here is to observe which features are most helpful in predicting malignant or benign cancer. The other is to see general trends that may aid us in model selection and hyper parameter selection.

Apply 3 techniques to understand each attribute of our dataset independently. Histograms. Density Plots. Box and Whisker Plots.

frequency of cancer diagnosis

first lets get the frequency of cancer diagnosis M= Malignant (indicates prescence of cancer cells) B= Benign (indicates abscence) 357 observations which account for 62.7% of all observations indicating the absence of cancer cells, 212 which account for 37.3% of all observations shows the presence of cancerous cell.

Visualise distribution of data via histograms

We can see that perhaps the attributes concavity,and concavity_point may have an exponential distribution. We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

feature	skewness
radius_mean	(0.94)
texture_mean	(0.65)
perimeter_mean	(0.99)
area_mean	(1.64)
smoothness_mean	(0.45)
compactness_mean	(1.18)
concavity_mean	(1.39)
concave.points_mean	(1.17)
symmetry_mean	(0.72)
fractal_dimension_mean	(1.3)

The skew result show a positive (right) skew for all features. Values closer to zero show less skew.

feature	kurtosis
radius_mean	(0.81)
texture_mean	(0.73)
perimeter_mean	(0.94)
area_mean	(3.59)
smoothness_mean	(0.82)
compactness_mean	(1.61)
concavity_mean	(1.95)
concave.points_mean	(1.03)
symmetry_mean	(1.25)
fractal_dimension_mean	(2.95)

Visualise distribution of data via histograms group by diagnosis

Comparison of radius distribution by malignancy shows that larger values of radius_mean, perimeter_mean, area_mean, concavity_mean and concave.points_mean tends to show a correlation with malignant tumors.

Visualize distribution of data via density plots

We can see that perhaps the attributes perimeter,radius, area, concavity,compactness may have an exponential distribution. We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution.

Visualise distribution of data via box plots

Looking at the different boxplots it looks like area_mean and fractal_dimension_mean have large number of outlier points which are determined as the values above 1.5*interquartile range (IQR). These outlier values represent a group of measurements which are much higher than the average.

Let’s see if it warrants further cleanup.

Outlier check for area_mean:

## Outliers identified: 25  Propotion (%) of outliers: 4.6  Mean of the outliers: 1670.52  Mean without removing outliers: 654.89  Mean if we remove outliers: 608.21

Proportion of outliers is less than five percent (4.6%), not required further cleanup.

Outlier check for fractal_dimension_mean:

## Outliers identified: 15  Propotion (%) of outliers: 2.7  Mean of the outliers: 0.09  Mean without removing outliers: 0.06  Mean if we remove outliers: 0.06

Proportion of outliers is less than five percent (2.7%), no effect on the mean, not required further cleanup.

Visualise distribution of data via box plots group by diagnois

Boxplots are another way of showing the various attributes. With the help of these, we can understand what are the median values for Benign or Malignant cases. e.g, in the case of “area_mean”, the median value is 500 for Benign cases, and 1000 for Malignant ones. The dots tells that there are some values which are outliers, exception cases. The observation here is that, the median value of various variables is much higher in Malignant cases.

Bivariate/Multivariate Analysis

Correlation Plot

We are also interested in how the 10 predictors relate to each other. To see bivariate relationships among these 10 predictors, we will look at correlations and scatterplots.

MLproject

Kalshtein Yael

23 בנובמבר 2017