Abstract

This analysis deals with exploring different morphometric features of cell nuclei and their relationship to diagnosis. The key morphometric features that we are going to focus on are the radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points. Within each section of the analysis, we will explore each feature separately in relation to diagnosis (benign vs malignant) and then compare each feature to each other.

The data for this analysis were collected from a Fine Needle Aspiration Cytology (FNAC). This process includes using a narrow-gauge needle to collect a lesion sample for microscopic examination. (Roskell and Buley 2004) “In symptomatic breast disease, FNAC used alongside clinical and radiological assessment allows rapid, inexpensive, and accurate diagnosis.” (Roskell and Buley 2004) The data represented in this analysis can be found on archive.ics.uci.edu.

Attribute Information:

  1. ID number
  2. Diagnosis (M = Malignant, B = Benign)

The Morphometric Features:

  1. radius (mean of distances from the center to points on the perimeter in um (micrometers))
  2. perimeter (um)
  3. area (um2) (Abdalla et al. 2008)
  4. smoothness (local variation in radius lengths)
  5. compactness (perimeter2 / area - 1.0)
  6. concavity (severity of concave points along the nuclear border)
  7. concave points (number of concave points along the nuclear border)

The data analysis will be broken down by the following sections:

Radius, Perimeter, and Area to Cell Nuclei Diagnosis

One of the most common cancers within the female population is breast carcinoma. Normal cells will transform and the formation of cancer cells will occur. The transformation of these cells can be assessed by looking at their nuclear morphometry. The nuclear morphometric features of size, shape, pattern, etc, have shown to predict the prognosis of breast cancer patients. (Narasimha, Vasavi, and Kumar 2013)

We will evaluate size by their radius, perimeter, and area. As shown in the study conducted by (Narasimha, Vasavi, and Kumar 2013) , “there was a gradual increase in the nuclear area and perimeter in carcinomas when compared to benign lesions.” Therefore, we will assume a positive correlation between size and carcinomas (malignant cell nuclei).

We will begin our research with individual analysis of the radius, perimeter, and area and their relationship to the diagnosis.

The boxplot below shows the average radius of benign and malignant cell nuclei:

kable1 <- data %>%
  group_by(diagnosis) %>%
  summarise(mean(radius_mean)) # average radius Benign = 12.1 and Malignant = 17.5

knitr::kable(kable1)
diagnosis mean(radius_mean)
B 12.14652
M 17.46283
t.test(mean_bc_data$radius_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  mean_bc_data$radius_mean by mean_bc_data$diagnosis
## t = -22.209, df = 289.71, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.787448 -4.845165
## sample estimates:
## mean in group B mean in group M 
##        12.14652        17.46283

As seen in the averages, the average radius of benign cell nuclei is 12.1 um and of malignant cell nuclei is 17.5 um. The averages are statistically different from each other as shown in the t-test.

Because our p-value is close to zero, we can reject our null hypothesis above and conclude that there is a statistically significant difference between the two averages.

Next, we will look at the average perimeter of benign and malignant cell nuclei:

kable2<- mean_bc_data %>%
  group_by(diagnosis) %>%
  summarise(mean(perimeter_mean)) # average perimeter Benign = 78.1 and Malignant = 115

knitr::kable(kable2)
diagnosis mean(perimeter_mean)
B 78.07541
M 115.36538
t.test(mean_bc_data$perimeter_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  mean_bc_data$perimeter_mean by mean_bc_data$diagnosis
## t = -22.935, df = 285.41, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -40.49020 -34.08974
## sample estimates:
## mean in group B mean in group M 
##        78.07541       115.36538

Lastly, we will look at the average area of benign cell nuclei and malignant cell nuclei:

kable3 <- mean_bc_data %>%
  group_by(diagnosis) %>%
  summarise(mean(area_mean)) # average area Benign = 463 and Malignant = 978

knitr::kable(kable3)
diagnosis mean(area_mean)
B 462.7902
M 978.3764
t.test(mean_bc_data$area_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  mean_bc_data$area_mean by mean_bc_data$diagnosis
## t = -19.641, df = 244.79, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -567.2919 -463.8805
## sample estimates:
## mean in group B mean in group M 
##        462.7902        978.3764

Let’s visualize all 3 plots side-by-side and record the relationship seen:

radius_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = radius_mean)) +
  geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
  labs(x = "Diagnosis (Benign vs Malignant)", 
       y = "Average Radius",
       title = "Average Radius of Malignant Cell Nuclei are Larger",
       subtitle = "Breast Cancer UCI Data",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

peri_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = perimeter_mean)) +
  geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
  labs(x = "Diagnosis (Benign vs Malignant)", 
       y = "Average Perimeter",
       title = "Average Perimeter of Malignant Cell Nuclei are Larger",
       subtitle = "Breast Cancer UCI Data",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

area_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = area_mean)) +
  geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
  labs(x = "Diagnosis (Benign vs Malignant)", 
       y = "Average Area",
       title = "Average Area of Malignant Cell Nuclei are Larger",
       subtitle = "Breast Cancer UCI Data",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(radius_bp, peri_bp, area_bp, cols = 2) #the above 3 plots plotted on a single graph for visibility purposes

Let’s now look at how the patients are distributed among the different diagnoses:

radius_dp <- ggplot(mean_bc_data, aes(x = radius_mean, fill = diagnosis)) +
  geom_density(size = 1, alpha = .5) +
  labs(x = "Average Radius", 
       y = "Density",
       title = "Majority of Malignant Cells have an",
       subtitle = "Average Radius of between 15 and 20",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

peri_dp <- ggplot(mean_bc_data, aes(x = perimeter_mean, fill = diagnosis)) +
  geom_density(size = 1, alpha = .5) +
  labs(x = "Average Perimeter", 
       y = "Density",
       title = "Majority of Malignant Cells have an",
       subtitle = "Average Perimeter of between 100 and 130",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))


area_dp <- ggplot(mean_bc_data, aes(x = area_mean, fill = diagnosis)) +
  geom_density(size = 1, alpha = .5) +
  labs(x = "Average Area", 
       y = "Density",
       title = "Majority of Malignant Cells have an",
       subtitle = "Average Area of between 700 and 1250",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(radius_dp, peri_dp, area_dp, cols = 2) #the above 3 plots plotted on a single graph for visibility purposes

Lastly, let’s look at the relationship of each variable of size (radius, perimeter, and area) to each other.

We will do this with the use of scatter plots as shown below:

rp_sp <- ggplot(mean_bc_data, aes(x = radius_mean, y = perimeter_mean, color = diagnosis)) +
  geom_point()+
  labs(x = "Average Radius", 
       y = "Average Perimeter",
       title = "Positive Linear Correlation",
       subtitle = "Between the Radius and Perimeter",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ra_sp <- ggplot(mean_bc_data, aes(x = radius_mean, y = area_mean, color = diagnosis)) +
  geom_point() +
  labs(x = "Average Radius", 
       y = "Average Area",
       title = "Positive Correlation",
       subtitle = "Between the Radius and Area",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

pa_sp <- ggplot(mean_bc_data, aes(x = perimeter_mean, y = area_mean, color = diagnosis)) +
  geom_point() +
  labs(x = "Average Perimeter", 
       y = "Average Area",
       title = "Positive Correlation",
       subtitle = "Between the Perimeter and Area",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(rp_sp, ra_sp, pa_sp, cols = 2)

Smoothness and Compactness to Cell Nuclei Diagnosis

Smoothness is calculated by the local variation in radius lengths within cell nuclei. The closer it is to zero, the less variation and the more smooth the cell nuclei are. The further away it is from zero, the more variability and less smooth the cell nuclei are.

We will visualize the average smoothness between benign and malignant cells:

ggplot(mean_bc_data, aes(x = diagnosis, y = smoothness_mean)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  labs(x = "Diagnosis", 
       y = "Average Smoothness",
       title = "Malignant Cell Nuclei are Less Smooth",
       subtitle = "Closer to Zero, Less Variation in Radius Lengths",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

t.test(mean_bc_data$smoothness_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  mean_bc_data$smoothness_mean by mean_bc_data$diagnosis
## t = -9.2974, df = 466.21, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01262337 -0.00821832
## sample estimates:
## mean in group B mean in group M 
##      0.09247765      0.10289849

Compactness is defined by the cell’s ability to be packed together closely.

In this case, the smaller the number, the more compact it is and the larger the number the less compact it is:

ggplot(mean_bc_data, aes(x = diagnosis, y = compactness_mean)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  labs(x = "Diagnosis", 
       y = "Average Compactness",
       title = "Malignant Cell Nuclei are Less Compact",
       subtitle = "Closer to Zero, the More Compact",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Malignant cell nuclei are less compact on average than benign cell nuclei. Both the smoothness and compactness seem to categorize diagnosis by the following:

Now, let’s look at the relationship between smoothness and compactness of a cell nuclei:

ggplot(mean_bc_data, aes(x = smoothness_mean, y = compactness_mean, color = diagnosis)) +
  geom_point() +
  geom_smooth(se = FALSE, size = 2) +
  labs(x = "Average Smoothness", 
       y = "Average Compactness",
       title = "Positive Correlation Between Smoothness and Compactness",
       subtitle = "Seen in Both Benign and Malignant Cell Nuclei",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Concave Severity & Amount to Cell Nuclei Diagnosis

Within this section of the analysis, concavity represents the severity of concave points of the contour and the concave points represent the number of concave points of the contour. The further away it is from zero, the more severe the concave points are and the higher the total number of concave points are.

Let’s take a look at the summary data for concavity:

kable4 <- mean_bc_data %>%
  summarise(
    mean(concavity_mean),
    min(concavity_mean),
    max(concavity_mean)
  )

knitr::kable(kable4)
mean(concavity_mean) min(concavity_mean) max(concavity_mean)
0.0887993 0 0.4268

Let’s visualize the relationship between concavity (severity of concave points) to diagnosis:

ggplot(mean_bc_data, aes(x = diagnosis, y = concavity_mean)) +
  geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
  labs(x = "Diagnosis", 
       y = "Average Concavity",
       title = "Malignant Cell Nuclei are More Severe in Concavity",
       subtitle = "Closer to Zero, Less Severity in Concave Points",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Now let’s explore the number of concave points in the same regard as severity:

kable5 <- mean_bc_data %>%
  summarise(
    mean(concave.points_mean),
    min(concave.points_mean),
    max(concave.points_mean)
  )

knitr::kable(kable5)
mean(concave.points_mean) min(concave.points_mean) max(concave.points_mean)
0.0489191 0 0.2012
ggplot(mean_bc_data, aes(x = diagnosis, y = concave.points_mean)) +
  geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
  labs(x = "Diagnosis", 
       y = "Average Number of Concave Points",
       title = "Malignant Cell Nuclei Have More Concave Points",
       subtitle = "Closer to Zero, Less Concave Points",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Therefore, if we plot the above variables, we assume to see a positive relationship as well:

ggplot(mean_bc_data, aes(x = concave.points_mean, y = concavity_mean, color = diagnosis)) +
  geom_point(position = "jitter", alha = 1/5) +
  geom_smooth(se = FALSE, size = 2) +
  geom_vline(xintercept = .0853, linetype = "dashed", color = "blue", size = 1) +
  labs(x = "Average Number of Concave Points", 
       y = "Average Concavity",
       title = "Benign Cell Nuclei has a Sudden Breakpoint",
       subtitle = "Number of Concave Points can be a Good Predictor of Malignancy",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

There seems to be a breakpoint in benign cancer cells (in terms of concave points) as shown by the blue dashed line

kable6 <- mean_bc_data %>%
  filter(diagnosis == "B") %>%
  summarise(
    mean(concave.points_mean),
    max(concave.points_mean) # max number of concave points is .0853. Let's add a vertical line to ggplot above at .09 to show cut off of concave points.
  )

knitr::kable(kable6)
mean(concave.points_mean) max(concave.points_mean)
0.0257174 0.08534

With that being said, can we use the number of concave points as a strong indicator/predictor for malignancy within a cell nuclei? Such as that an average greater than .0853 is a good indicator of malignancy? Can this be applied to the population?

Logistic Regression - Binomial for all Analyzed Variables in Relation to Diagnosis

Below is a Binomial Logistic Linear Regression model for predicting the diagnosis of the cell nuclei.

The model will utilize the variables used throughout this analysis (radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points) and determine which are good predictors and which are not. Good predictors are indicated by a small p-value and large effect size (Estimate).

predicted <- glm(diagnosis ~ radius_mean + perimeter_mean + area_mean + smoothness_mean + compactness_mean + concavity_mean + concave.points_mean, family = "binomial", data = mean_bc_data)
summary(predicted)
## 
## Call:
## glm(formula = diagnosis ~ radius_mean + perimeter_mean + area_mean + 
##     smoothness_mean + compactness_mean + concavity_mean + concave.points_mean, 
##     family = "binomial", data = mean_bc_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.33160  -0.24636  -0.12261   0.01534   2.68460  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          2.77059    8.27776   0.335  0.73785   
## radius_mean         -3.23813    2.87701  -1.126  0.26037   
## perimeter_mean       0.18040    0.39185   0.460  0.64524   
## area_mean            0.03258    0.01308   2.491  0.01275 * 
## smoothness_mean     27.84861   24.55626   1.134  0.25676   
## compactness_mean    -8.43463   12.33468  -0.684  0.49409   
## concavity_mean       5.22390    7.02471   0.744  0.45709   
## concave.points_mean 66.82251   23.39177   2.857  0.00428 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.44  on 568  degrees of freedom
## Residual deviance: 198.94  on 561  degrees of freedom
## AIC: 214.94
## 
## Number of Fisher Scoring iterations: 8

As seen above there are two variables that are statistically significant (less than 5%):

Now we will analyze the effect size (Estimate):

We can conclude from our regression analysis that the average area is a good predictor but the average number of concave points is the best predictor of diagnosis (Benign vs Malignant).

Lastly, we will graph our Regression Model above and ensure it captures the expected binomial relationship between diagnosis and the variables:

probability_data <- data.frame(fitted.values = predicted$fitted.values, status = mean_bc_data$diagnosis)

probability_data <- probability_data %>%
  arrange(fitted.values)

probability_data <- probability_data %>%
  mutate(rank = 1:nrow(probability_data))

ggplot(probability_data, aes(x = rank, y = fitted.values, color = status)) +
  geom_point(alpha = 1, shape = 1, stroke = 2) +
  labs(x = "Rank", 
       y = "Predicted Probability of Malignancy",
       title = "Predicted Probability of Malignant Cell Nuclei",
       subtitle = "Closer to One, the Higher the Probability of Malignancy",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Conclusion

One of the most common cancers within the female population is breast carcinoma. Understanding the correlation between nuclear morphometry and diagnosis can lead to an early and accurate diagnosis. Which can lead to early treatment. As shown in this analysis, the radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points all showed a positive correlation with malignancy. As the number grew further from zero, the higher the likelihood of malignancy.

Although all morphometric features showed statistically meaningful correlations, there were only a select few that were marked as good predictors of malignancy as seen in our Binomial Linear Regression model. Overall, the average area and the number of concave points of the cell nuclei were the two good predictors of malignancy in our analysis. The reason being, the average area had a p-value of 0.013 and an effect size of 0.033 and the average number of concave points had a p-value of 0.0043 and an effect size of 66.82. Due to the large effect size and small p-value, the average number of concave points is the best predictor within our analysis.

Opportunities for further analysis could be researching the following:

Citations

Abdalla, Fathi, Jamela Boder, Abdelbaset Buhmeida, Hussein Hashmi, Adem Elzagheid, and YRJÖ COLLAN. 2008. “Nuclear Morphometry in Fnabs of Breast Disease in Libyans.” Anticancer Research 28 (6B): 3985–9.

Narasimha, Aparna, B Vasavi, and Harendra ML Kumar. 2013. “Significance of Nuclear Morphometry in Benign and Malignant Breast Aspirates.” International Journal of Applied and Basic Medical Research 3 (1): 22.

Roskell, Derek E, and Ian D Buley. 2004. “Fine Needle Aspiration Cytology in Cancer Diagnosis.” British Medical Journal Publishing Group.