Breast Cancer UCI Analysis

Abstract

This analysis deals with exploring different morphometric features of cell nuclei and their relationship to diagnosis. The key morphometric features that we are going to focus on are the radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points. Within each section of the analysis, we will explore each feature separately in relation to diagnosis (benign vs malignant) and then compare each feature to each other.

The data for this analysis were collected from a Fine Needle Aspiration Cytology (FNAC). This process includes using a narrow-gauge needle to collect a lesion sample for microscopic examination. (Roskell and Buley 2004) “In symptomatic breast disease, FNAC used alongside clinical and radiological assessment allows rapid, inexpensive, and accurate diagnosis.” (Roskell and Buley 2004) The data represented in this analysis can be found on archive.ics.uci.edu.

Attribute Information:

ID number
Diagnosis (M = Malignant, B = Benign)

The Morphometric Features:

radius (mean of distances from the center to points on the perimeter in um (micrometers))
perimeter (um)
area (um²) (Abdalla et al. 2008)
smoothness (local variation in radius lengths)
compactness (perimeter² / area - 1.0)
concavity (severity of concave points along the nuclear border)
concave points (number of concave points along the nuclear border)

The data analysis will be broken down by the following sections:

Radius, Perimeter, and Area to Cell Nuclei Diagnosis
Smoothness and Compactness to Cell Nuclei Diagnosis
Concave Severity & Amount to Cell Nuclei Diagnosis
Logistic Regression - Binomial for all Analyzed Variables in Relation to Diagnosis

Radius, Perimeter, and Area to Cell Nuclei Diagnosis

One of the most common cancers within the female population is breast carcinoma. Normal cells will transform and the formation of cancer cells will occur. The transformation of these cells can be assessed by looking at their nuclear morphometry. The nuclear morphometric features of size, shape, pattern, etc, have shown to predict the prognosis of breast cancer patients. (Narasimha, Vasavi, and Kumar 2013)

We will evaluate size by their radius, perimeter, and area. As shown in the study conducted by (Narasimha, Vasavi, and Kumar 2013) , “there was a gradual increase in the nuclear area and perimeter in carcinomas when compared to benign lesions.” Therefore, we will assume a positive correlation between size and carcinomas (malignant cell nuclei).

We will begin our research with individual analysis of the radius, perimeter, and area and their relationship to the diagnosis.

The boxplot below shows the average radius of benign and malignant cell nuclei:

kable1 <- data %>%
  group_by(diagnosis) %>%
  summarise(mean(radius_mean)) # average radius Benign = 12.1 and Malignant = 17.5

knitr::kable(kable1)

diagnosis	mean(radius_mean)
B	12.14652
M	17.46283

t.test(mean_bc_data$radius_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  mean_bc_data$radius_mean by mean_bc_data$diagnosis
## t = -22.209, df = 289.71, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.787448 -4.845165
## sample estimates:
## mean in group B mean in group M 
##        12.14652        17.46283

As seen in the averages, the average radius of benign cell nuclei is 12.1 um and of malignant cell nuclei is 17.5 um. The averages are statistically different from each other as shown in the t-test.

Our null hypothesis: there is no difference between the averages of benign and malignant cell nuclei.

Because our p-value is close to zero, we can reject our null hypothesis above and conclude that there is a statistically significant difference between the two averages.

Since zero is also not within the confidence interval either, it supports our rejection of the null with 95% confidence.

Next, we will look at the average perimeter of benign and malignant cell nuclei:

kable2<- mean_bc_data %>%
  group_by(diagnosis) %>%
  summarise(mean(perimeter_mean)) # average perimeter Benign = 78.1 and Malignant = 115

knitr::kable(kable2)

diagnosis	mean(perimeter_mean)
B	78.07541
M	115.36538

t.test(mean_bc_data$perimeter_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  mean_bc_data$perimeter_mean by mean_bc_data$diagnosis
## t = -22.935, df = 285.41, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -40.49020 -34.08974
## sample estimates:
## mean in group B mean in group M 
##        78.07541       115.36538

On average, benign cell nuclei had an average perimeter of 78.08 um and malignant cell nuclei had an average of 115.4 um.
As shown in our t-test, there is a statistically significant difference.

Lastly, we will look at the average area of benign cell nuclei and malignant cell nuclei:

kable3 <- mean_bc_data %>%
  group_by(diagnosis) %>%
  summarise(mean(area_mean)) # average area Benign = 463 and Malignant = 978

knitr::kable(kable3)

diagnosis	mean(area_mean)
B	462.7902
M	978.3764

t.test(mean_bc_data$area_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  mean_bc_data$area_mean by mean_bc_data$diagnosis
## t = -19.641, df = 244.79, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -567.2919 -463.8805
## sample estimates:
## mean in group B mean in group M 
##        462.7902        978.3764

As seen again, the malignant cell nuclei have larger averages than benign cell nuclei which are statistically significant.
The benign cell nuclei had an average area of 463 um² and the malignant cell nuclei had an average area of 978 um².

Let’s visualize all 3 plots side-by-side and record the relationship seen:

radius_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = radius_mean)) +
  geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
  labs(x = "Diagnosis (Benign vs Malignant)", 
       y = "Average Radius",
       title = "Average Radius of Malignant Cell Nuclei are Larger",
       subtitle = "Breast Cancer UCI Data",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

peri_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = perimeter_mean)) +
  geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
  labs(x = "Diagnosis (Benign vs Malignant)", 
       y = "Average Perimeter",
       title = "Average Perimeter of Malignant Cell Nuclei are Larger",
       subtitle = "Breast Cancer UCI Data",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

area_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = area_mean)) +
  geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
  labs(x = "Diagnosis (Benign vs Malignant)", 
       y = "Average Area",
       title = "Average Area of Malignant Cell Nuclei are Larger",
       subtitle = "Breast Cancer UCI Data",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(radius_bp, peri_bp, area_bp, cols = 2) #the above 3 plots plotted on a single graph for visibility purposes

The relationship seen within this side-by-side comparison is that each variable of size (radius, perimeter, and area) shares a positive correlation with malignancy. As the size of each variable increases, there is a higher likelihood of being categorized as malignant.
This shared relationship is not surprising since a greater radius would signify a greater perimeter, which would signify a greater area.

Let’s now look at how the patients are distributed among the different diagnoses:

radius_dp <- ggplot(mean_bc_data, aes(x = radius_mean, fill = diagnosis)) +
  geom_density(size = 1, alpha = .5) +
  labs(x = "Average Radius", 
       y = "Density",
       title = "Majority of Malignant Cells have an",
       subtitle = "Average Radius of between 15 and 20",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

peri_dp <- ggplot(mean_bc_data, aes(x = perimeter_mean, fill = diagnosis)) +
  geom_density(size = 1, alpha = .5) +
  labs(x = "Average Perimeter", 
       y = "Density",
       title = "Majority of Malignant Cells have an",
       subtitle = "Average Perimeter of between 100 and 130",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))


area_dp <- ggplot(mean_bc_data, aes(x = area_mean, fill = diagnosis)) +
  geom_density(size = 1, alpha = .5) +
  labs(x = "Average Area", 
       y = "Density",
       title = "Majority of Malignant Cells have an",
       subtitle = "Average Area of between 700 and 1250",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(radius_dp, peri_dp, area_dp, cols = 2) #the above 3 plots plotted on a single graph for visibility purposes

The density plot is a great visual aid to see how the patients are distributed depending on their diagnosis. The peaks signify that at that particular x value, a majority of the patients will fall into that bin.
- Benign cells have one peak while malignant cells have two peaks

Lastly, let’s look at the relationship of each variable of size (radius, perimeter, and area) to each other.

We will do this with the use of scatter plots as shown below:

rp_sp <- ggplot(mean_bc_data, aes(x = radius_mean, y = perimeter_mean, color = diagnosis)) +
  geom_point()+
  labs(x = "Average Radius", 
       y = "Average Perimeter",
       title = "Positive Linear Correlation",
       subtitle = "Between the Radius and Perimeter",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ra_sp <- ggplot(mean_bc_data, aes(x = radius_mean, y = area_mean, color = diagnosis)) +
  geom_point() +
  labs(x = "Average Radius", 
       y = "Average Area",
       title = "Positive Correlation",
       subtitle = "Between the Radius and Area",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

pa_sp <- ggplot(mean_bc_data, aes(x = perimeter_mean, y = area_mean, color = diagnosis)) +
  geom_point() +
  labs(x = "Average Perimeter", 
       y = "Average Area",
       title = "Positive Correlation",
       subtitle = "Between the Perimeter and Area",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(rp_sp, ra_sp, pa_sp, cols = 2)

An increase in one variable of size (radius, perimeter, or area) will result in an increase of another variable as they all share a positive correlation with one another. We also see a natural shift of malignancy as the size increases in all three measurements of radius, perimeter, and area.
Therefore, we can conclude that large cell nuclei could increase the likelihood of malignancy.

Smoothness and Compactness to Cell Nuclei Diagnosis

Smoothness is calculated by the local variation in radius lengths within cell nuclei. The closer it is to zero, the less variation and the more smooth the cell nuclei are. The further away it is from zero, the more variability and less smooth the cell nuclei are.

We will visualize the average smoothness between benign and malignant cells:

ggplot(mean_bc_data, aes(x = diagnosis, y = smoothness_mean)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  labs(x = "Diagnosis", 
       y = "Average Smoothness",
       title = "Malignant Cell Nuclei are Less Smooth",
       subtitle = "Closer to Zero, Less Variation in Radius Lengths",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

t.test(mean_bc_data$smoothness_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  mean_bc_data$smoothness_mean by mean_bc_data$diagnosis
## t = -9.2974, df = 466.21, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01262337 -0.00821832
## sample estimates:
## mean in group B mean in group M 
##      0.09247765      0.10289849

We see that on average, malignant cells have more variation and are less smooth. Also, according to our t-test, there is a statistically significant difference between the average smoothness of benign and malignant cell nuclei.

Compactness is defined by the cell’s ability to be packed together closely.

In this case, the smaller the number, the more compact it is and the larger the number the less compact it is:

ggplot(mean_bc_data, aes(x = diagnosis, y = compactness_mean)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  labs(x = "Diagnosis", 
       y = "Average Compactness",
       title = "Malignant Cell Nuclei are Less Compact",
       subtitle = "Closer to Zero, the More Compact",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Malignant cell nuclei are less compact on average than benign cell nuclei. Both the smoothness and compactness seem to categorize diagnosis by the following:

The closer you are to zero the higher the likelihood you have benign cell nuclei and the further you are from zero the higher the likelihood you have malignant cell nuclei.

Now, let’s look at the relationship between smoothness and compactness of a cell nuclei:

ggplot(mean_bc_data, aes(x = smoothness_mean, y = compactness_mean, color = diagnosis)) +
  geom_point() +
  geom_smooth(se = FALSE, size = 2) +
  labs(x = "Average Smoothness", 
       y = "Average Compactness",
       title = "Positive Correlation Between Smoothness and Compactness",
       subtitle = "Seen in Both Benign and Malignant Cell Nuclei",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

This graph effectively tells us that both of the diagnoses are positively correlated with smoothness and compactness.

Concave Severity & Amount to Cell Nuclei Diagnosis

Within this section of the analysis, concavity represents the severity of concave points of the contour and the concave points represent the number of concave points of the contour. The further away it is from zero, the more severe the concave points are and the higher the total number of concave points are.

Let’s take a look at the summary data for concavity:

kable4 <- mean_bc_data %>%
  summarise(
    mean(concavity_mean),
    min(concavity_mean),
    max(concavity_mean)
  )

knitr::kable(kable4)

mean(concavity_mean)	min(concavity_mean)	max(concavity_mean)
0.0887993	0	0.4268

The range of the data within concavity is between 0 and 0.427. Zero indicating no severity in concave points (relatively flat).

Let’s visualize the relationship between concavity (severity of concave points) to diagnosis:

ggplot(mean_bc_data, aes(x = diagnosis, y = concavity_mean)) +
  geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
  labs(x = "Diagnosis", 
       y = "Average Concavity",
       title = "Malignant Cell Nuclei are More Severe in Concavity",
       subtitle = "Closer to Zero, Less Severity in Concave Points",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

As seen above, the higher the average concavity, the higher the likelihood of malignancy.

Now let’s explore the number of concave points in the same regard as severity:

kable5 <- mean_bc_data %>%
  summarise(
    mean(concave.points_mean),
    min(concave.points_mean),
    max(concave.points_mean)
  )

knitr::kable(kable5)

mean(concave.points_mean)	min(concave.points_mean)	max(concave.points_mean)
0.0489191	0	0.2012

The range of the data within average concave points is between 0 and 0.201. Zero indicating no concave points around the nuclear border.

ggplot(mean_bc_data, aes(x = diagnosis, y = concave.points_mean)) +
  geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
  labs(x = "Diagnosis", 
       y = "Average Number of Concave Points",
       title = "Malignant Cell Nuclei Have More Concave Points",
       subtitle = "Closer to Zero, Less Concave Points",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Overall, the higher the number of concave points around the nuclear border, the higher the likelihood of malignancy.
Both the severity and amount of concave points indicate a positive relationship with the diagnosis.

Therefore, if we plot the above variables, we assume to see a positive relationship as well:

ggplot(mean_bc_data, aes(x = concave.points_mean, y = concavity_mean, color = diagnosis)) +
  geom_point(position = "jitter", alha = 1/5) +
  geom_smooth(se = FALSE, size = 2) +
  geom_vline(xintercept = .0853, linetype = "dashed", color = "blue", size = 1) +
  labs(x = "Average Number of Concave Points", 
       y = "Average Concavity",
       title = "Benign Cell Nuclei has a Sudden Breakpoint",
       subtitle = "Number of Concave Points can be a Good Predictor of Malignancy",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

This relationship makes sense because the higher the number of concave points, the higher the severity.
- If there are no concave points, there should be virtually no severity as the concavity is not there.

There seems to be a breakpoint in benign cancer cells (in terms of concave points) as shown by the blue dashed line

100% of cancer cells with an average number of concave points > 0.0853 are Malignant as shown by the maximum below

kable6 <- mean_bc_data %>%
  filter(diagnosis == "B") %>%
  summarise(
    mean(concave.points_mean),
    max(concave.points_mean) # max number of concave points is .0853. Let's add a vertical line to ggplot above at .09 to show cut off of concave points.
  )

knitr::kable(kable6)

mean(concave.points_mean)	max(concave.points_mean)
0.0257174	0.08534

With that being said, can we use the number of concave points as a strong indicator/predictor for malignancy within a cell nuclei? Such as that an average greater than .0853 is a good indicator of malignancy? Can this be applied to the population?

We can explore that with a Logistic Linear Regression model and see if this variable is a good predictor of malignancy compared to the other known variables we’ve already explored.

Logistic Regression - Binomial for all Analyzed Variables in Relation to Diagnosis

Below is a Binomial Logistic Linear Regression model for predicting the diagnosis of the cell nuclei.

The model will utilize the variables used throughout this analysis (radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points) and determine which are good predictors and which are not. Good predictors are indicated by a small p-value and large effect size (Estimate).

predicted <- glm(diagnosis ~ radius_mean + perimeter_mean + area_mean + smoothness_mean + compactness_mean + concavity_mean + concave.points_mean, family = "binomial", data = mean_bc_data)
summary(predicted)

## 
## Call:
## glm(formula = diagnosis ~ radius_mean + perimeter_mean + area_mean + 
##     smoothness_mean + compactness_mean + concavity_mean + concave.points_mean, 
##     family = "binomial", data = mean_bc_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.33160  -0.24636  -0.12261   0.01534   2.68460  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          2.77059    8.27776   0.335  0.73785   
## radius_mean         -3.23813    2.87701  -1.126  0.26037   
## perimeter_mean       0.18040    0.39185   0.460  0.64524   
## area_mean            0.03258    0.01308   2.491  0.01275 * 
## smoothness_mean     27.84861   24.55626   1.134  0.25676   
## compactness_mean    -8.43463   12.33468  -0.684  0.49409   
## concavity_mean       5.22390    7.02471   0.744  0.45709   
## concave.points_mean 66.82251   23.39177   2.857  0.00428 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.44  on 568  degrees of freedom
## Residual deviance: 198.94  on 561  degrees of freedom
## AIC: 214.94
## 
## Number of Fisher Scoring iterations: 8

As seen above there are two variables that are statistically significant (less than 5%):

Average Area
Average Number of Concave Points

Now we will analyze the effect size (Estimate):

Although the average area has a statistically significant p-value, the effect size is relatively small at 0.03258.
However, the average number of concave points has a statistically significant p-value and has a relatively large effect size at 66.82251.

We can conclude from our regression analysis that the average area is a good predictor but the average number of concave points is the best predictor of diagnosis (Benign vs Malignant).

Lastly, we will graph our Regression Model above and ensure it captures the expected binomial relationship between diagnosis and the variables:

probability_data <- data.frame(fitted.values = predicted$fitted.values, status = mean_bc_data$diagnosis)

probability_data <- probability_data %>%
  arrange(fitted.values)

probability_data <- probability_data %>%
  mutate(rank = 1:nrow(probability_data))

ggplot(probability_data, aes(x = rank, y = fitted.values, color = status)) +
  geom_point(alpha = 1, shape = 1, stroke = 2) +
  labs(x = "Rank", 
       y = "Predicted Probability of Malignancy",
       title = "Predicted Probability of Malignant Cell Nuclei",
       subtitle = "Closer to One, the Higher the Probability of Malignancy",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

As seen by the graph, our linear model has accurately captured the relationship of the variables in relation to diagnosis.

Conclusion

One of the most common cancers within the female population is breast carcinoma. Understanding the correlation between nuclear morphometry and diagnosis can lead to an early and accurate diagnosis. Which can lead to early treatment. As shown in this analysis, the radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points all showed a positive correlation with malignancy. As the number grew further from zero, the higher the likelihood of malignancy.

Although all morphometric features showed statistically meaningful correlations, there were only a select few that were marked as good predictors of malignancy as seen in our Binomial Linear Regression model. Overall, the average area and the number of concave points of the cell nuclei were the two good predictors of malignancy in our analysis. The reason being, the average area had a p-value of 0.013 and an effect size of 0.033 and the average number of concave points had a p-value of 0.0043 and an effect size of 66.82. Due to the large effect size and small p-value, the average number of concave points is the best predictor within our analysis.

Opportunities for further analysis could be researching the following:

Looking at the number of concave points as a focal morphometric feature in other types of cancer cells

Citations

Abdalla, Fathi, Jamela Boder, Abdelbaset Buhmeida, Hussein Hashmi, Adem Elzagheid, and YRJÖ COLLAN. 2008. “Nuclear Morphometry in Fnabs of Breast Disease in Libyans.” Anticancer Research 28 (6B): 3985–9.

Narasimha, Aparna, B Vasavi, and Harendra ML Kumar. 2013. “Significance of Nuclear Morphometry in Benign and Malignant Breast Aspirates.” International Journal of Applied and Basic Medical Research 3 (1): 22.

Roskell, Derek E, and Ian D Buley. 2004. “Fine Needle Aspiration Cytology in Cancer Diagnosis.” British Medical Journal Publishing Group.