data <- read.csv("~/data.csv")
library(ggplot2)
dim(data)
## [1] 569  33
colnames(data)
##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave.points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave.points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave.points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"
## [33] "X"
table(data$diagnosis)
## 
##   B   M 
## 357 212
data$diagnosis <- as.factor(data$diagnosis)

Nuclear Features

radius: individual nucleus is measured by averaging the length of the radial line segments difened by the centroid of the snake and the individual snake points.

ggplot(data, aes(x = radius_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = radius_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = radius_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

perimeter: total distence between the snake points consititues the nuclear perimeter

ggplot(data, aes(x = perimeter_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = perimeter_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = perimeter_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

Area: Nuclear area is measured simply by counting the number of pixels on the interior of the snake and adding one-half of the pixels in the perimeter.

ggplot(data, aes(x = area_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = area_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = area_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

Compactness: Perimeter and area are combined to give a measure of the compactness of the cell nuclei using the formula perimeter^2/area. This dimensionless number is minimized by a circular disk and increases with the irregularity of the boundary. However, this measure of shape also increases for elongated cell nuclei, which do not necessarily indicate an increased likelihood of malignancy. The feature is also biased upward for small cells because of the decreased accuracy imposed by digitization of the sample. We compensate for the fact that no single shape measurement seems to capture the idea of “irregular” by employing several different shape features.

ggplot(data, aes(x = compactness_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = compactness_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = compactness_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

Smoothness:The smoothness of a nuclear contour is quantified by measuring the difference between the length of a radial line and the mean length of the lines surrounding it. This is similar to the curvature energy computation in the snakes.

ggplot(data, aes(x = smoothness_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = smoothness_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = smoothness_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

Concavity:In a further attempt to capture shape information we measure the number and severity of concavities or indentations in a cell nucleus. We draw chords between non-adjacent snake points and measure the extent to which the actual boundary of the nucleus lies on the inside of each chord (see Figure 4). This parameter is greatly affected by the length of these chords, as smaller chords better capture small concavities. We have chosen to emphasize small indentations, as larger shape irregularities are by other features.

ggplot(data, aes(x = concavity_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = concavity_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = concavity_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

Concave Points:This feature is similar to Concavity but measures only the number, rather than the ma.gnitude, of contour concavities.

ggplot(data, aes(x = concave.points_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = concave.points_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = concave.points_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

Symmetry:In order to measure symmetry, the major axis, or longest chord through the center, is found, We then measure the length difference between lines perpendicular to the major axis to the cell boundary in both directions. See Figure 5. Special care is taken to account for cases where the major axis cuts the cell boundary because of a concavity.

ggplot(data, aes(x = symmetry_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = symmetry_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = symmetry_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

Fractal Dimension:The fractal dimension of a cell is approximated using the “coastline approximation” described by Mandelbrot.9 The perimeter of the nucleus is measured using increasingly larger ‘rulers’. As the ruler size increases, decreasing the precision of the measurement, the observed perimeter decreases. Plotting these to values on a log scale and measuring the downward slope gives (the negative of) an approximation to the fractal dimension. As with all the shape features, a higher value corresponds to a less regular contour and thus to a higher probability of malignancy.

ggplot(data, aes(x = fractal_dimension_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = fractal_dimension_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = fractal_dimension_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

Texture:The texture of the cell nucleus is measured by finding the variance of the gray scale intensities in the component pixels.

ggplot(data, aes(x = texture_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = texture_se, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2)  + scale_fill_manual(values = c("#eaac8b", "#355070"))

ggplot(data, aes(x = texture_worst, fill = diagnosis)) + 
  geom_histogram(position = "dodge", binwidth = 2) + scale_fill_manual(values = c("#eaac8b", "#355070"))