This analysis deals with exploring different morphometric features of cell nuclei and their relationship to diagnosis. The key morphometric features that we are going to focus on are the radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points. Within each section of the analysis, we will explore each feature separately in relation to diagnosis (benign vs malignant) and then compare each feature to each other.
The data for this analysis were collected from a Fine Needle Aspiration Cytology (FNAC). This process includes using a narrow-gauge needle to collect a lesion sample for microscopic examination. (Roskell and Buley 2004) “In symptomatic breast disease, FNAC used alongside clinical and radiological assessment allows rapid, inexpensive, and accurate diagnosis.” (Roskell and Buley 2004) The data represented in this analysis can be found on archive.ics.uci.edu.
Attribute Information:
The Morphometric Features:
The data analysis will be broken down by the following sections:
Radius, Perimeter, and Area to Cell Nuclei Diagnosis
Smoothness and Compactness to Cell Nuclei Diagnosis
Concave Severity & Amount to Cell Nuclei Diagnosis
Logistic Regression - Binomial for all Analyzed Variables in Relation to Diagnosis
One of the most common cancers within the female population is breast carcinoma. Normal cells will transform and the formation of cancer cells will occur. The transformation of these cells can be assessed by looking at their nuclear morphometry. The nuclear morphometric features of size, shape, pattern, etc, have shown to predict the prognosis of breast cancer patients. (Narasimha, Vasavi, and Kumar 2013)
We will evaluate size by their radius, perimeter, and area. As shown in the study conducted by (Narasimha, Vasavi, and Kumar 2013) , “there was a gradual increase in the nuclear area and perimeter in carcinomas when compared to benign lesions.” Therefore, we will assume a positive correlation between size and carcinomas (malignant cell nuclei).
We will begin our research with individual analysis of the radius, perimeter, and area and their relationship to the diagnosis.
The boxplot below shows the average radius of benign and malignant cell nuclei:
kable1 <- data %>%
group_by(diagnosis) %>%
summarise(mean(radius_mean)) # average radius Benign = 12.1 and Malignant = 17.5
knitr::kable(kable1)
| diagnosis | mean(radius_mean) |
|---|---|
| B | 12.14652 |
| M | 17.46283 |
t.test(mean_bc_data$radius_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: mean_bc_data$radius_mean by mean_bc_data$diagnosis
## t = -22.209, df = 289.71, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.787448 -4.845165
## sample estimates:
## mean in group B mean in group M
## 12.14652 17.46283
As seen in the averages, the average radius of benign cell nuclei is 12.1 um and of malignant cell nuclei is 17.5 um. The averages are statistically different from each other as shown in the t-test.
Because our p-value is close to zero, we can reject our null hypothesis above and conclude that there is a statistically significant difference between the two averages.
Next, we will look at the average perimeter of benign and malignant cell nuclei:
kable2<- mean_bc_data %>%
group_by(diagnosis) %>%
summarise(mean(perimeter_mean)) # average perimeter Benign = 78.1 and Malignant = 115
knitr::kable(kable2)
| diagnosis | mean(perimeter_mean) |
|---|---|
| B | 78.07541 |
| M | 115.36538 |
t.test(mean_bc_data$perimeter_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: mean_bc_data$perimeter_mean by mean_bc_data$diagnosis
## t = -22.935, df = 285.41, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -40.49020 -34.08974
## sample estimates:
## mean in group B mean in group M
## 78.07541 115.36538
On average, benign cell nuclei had an average perimeter of 78.08 um and malignant cell nuclei had an average of 115.4 um.
As shown in our t-test, there is a statistically significant difference.
Lastly, we will look at the average area of benign cell nuclei and malignant cell nuclei:
kable3 <- mean_bc_data %>%
group_by(diagnosis) %>%
summarise(mean(area_mean)) # average area Benign = 463 and Malignant = 978
knitr::kable(kable3)
| diagnosis | mean(area_mean) |
|---|---|
| B | 462.7902 |
| M | 978.3764 |
t.test(mean_bc_data$area_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: mean_bc_data$area_mean by mean_bc_data$diagnosis
## t = -19.641, df = 244.79, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -567.2919 -463.8805
## sample estimates:
## mean in group B mean in group M
## 462.7902 978.3764
As seen again, the malignant cell nuclei have larger averages than benign cell nuclei which are statistically significant.
The benign cell nuclei had an average area of 463 um2 and the malignant cell nuclei had an average area of 978 um2.
Let’s visualize all 3 plots side-by-side and record the relationship seen:
radius_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = radius_mean)) +
geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
labs(x = "Diagnosis (Benign vs Malignant)",
y = "Average Radius",
title = "Average Radius of Malignant Cell Nuclei are Larger",
subtitle = "Breast Cancer UCI Data",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
peri_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = perimeter_mean)) +
geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
labs(x = "Diagnosis (Benign vs Malignant)",
y = "Average Perimeter",
title = "Average Perimeter of Malignant Cell Nuclei are Larger",
subtitle = "Breast Cancer UCI Data",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
area_bp <- ggplot(mean_bc_data, aes(x = diagnosis, y = area_mean)) +
geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
labs(x = "Diagnosis (Benign vs Malignant)",
y = "Average Area",
title = "Average Area of Malignant Cell Nuclei are Larger",
subtitle = "Breast Cancer UCI Data",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
multiplot(radius_bp, peri_bp, area_bp, cols = 2) #the above 3 plots plotted on a single graph for visibility purposes
The relationship seen within this side-by-side comparison is that each variable of size (radius, perimeter, and area) shares a positive correlation with malignancy. As the size of each variable increases, there is a higher likelihood of being categorized as malignant.
This shared relationship is not surprising since a greater radius would signify a greater perimeter, which would signify a greater area.
Let’s now look at how the patients are distributed among the different diagnoses:
radius_dp <- ggplot(mean_bc_data, aes(x = radius_mean, fill = diagnosis)) +
geom_density(size = 1, alpha = .5) +
labs(x = "Average Radius",
y = "Density",
title = "Majority of Malignant Cells have an",
subtitle = "Average Radius of between 15 and 20",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
peri_dp <- ggplot(mean_bc_data, aes(x = perimeter_mean, fill = diagnosis)) +
geom_density(size = 1, alpha = .5) +
labs(x = "Average Perimeter",
y = "Density",
title = "Majority of Malignant Cells have an",
subtitle = "Average Perimeter of between 100 and 130",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
area_dp <- ggplot(mean_bc_data, aes(x = area_mean, fill = diagnosis)) +
geom_density(size = 1, alpha = .5) +
labs(x = "Average Area",
y = "Density",
title = "Majority of Malignant Cells have an",
subtitle = "Average Area of between 700 and 1250",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
multiplot(radius_dp, peri_dp, area_dp, cols = 2) #the above 3 plots plotted on a single graph for visibility purposes
The density plot is a great visual aid to see how the patients are distributed depending on their diagnosis. The peaks signify that at that particular x value, a majority of the patients will fall into that bin.
Lastly, let’s look at the relationship of each variable of size (radius, perimeter, and area) to each other.
We will do this with the use of scatter plots as shown below:
rp_sp <- ggplot(mean_bc_data, aes(x = radius_mean, y = perimeter_mean, color = diagnosis)) +
geom_point()+
labs(x = "Average Radius",
y = "Average Perimeter",
title = "Positive Linear Correlation",
subtitle = "Between the Radius and Perimeter",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ra_sp <- ggplot(mean_bc_data, aes(x = radius_mean, y = area_mean, color = diagnosis)) +
geom_point() +
labs(x = "Average Radius",
y = "Average Area",
title = "Positive Correlation",
subtitle = "Between the Radius and Area",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
pa_sp <- ggplot(mean_bc_data, aes(x = perimeter_mean, y = area_mean, color = diagnosis)) +
geom_point() +
labs(x = "Average Perimeter",
y = "Average Area",
title = "Positive Correlation",
subtitle = "Between the Perimeter and Area",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
multiplot(rp_sp, ra_sp, pa_sp, cols = 2)
An increase in one variable of size (radius, perimeter, or area) will result in an increase of another variable as they all share a positive correlation with one another. We also see a natural shift of malignancy as the size increases in all three measurements of radius, perimeter, and area.
Therefore, we can conclude that large cell nuclei could increase the likelihood of malignancy.
Smoothness is calculated by the local variation in radius lengths within cell nuclei. The closer it is to zero, the less variation and the more smooth the cell nuclei are. The further away it is from zero, the more variability and less smooth the cell nuclei are.
We will visualize the average smoothness between benign and malignant cells:
ggplot(mean_bc_data, aes(x = diagnosis, y = smoothness_mean)) +
geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
labs(x = "Diagnosis",
y = "Average Smoothness",
title = "Malignant Cell Nuclei are Less Smooth",
subtitle = "Closer to Zero, Less Variation in Radius Lengths",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
t.test(mean_bc_data$smoothness_mean ~ mean_bc_data$diagnosis, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: mean_bc_data$smoothness_mean by mean_bc_data$diagnosis
## t = -9.2974, df = 466.21, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01262337 -0.00821832
## sample estimates:
## mean in group B mean in group M
## 0.09247765 0.10289849
Compactness is defined by the cell’s ability to be packed together closely.
In this case, the smaller the number, the more compact it is and the larger the number the less compact it is:
ggplot(mean_bc_data, aes(x = diagnosis, y = compactness_mean)) +
geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
labs(x = "Diagnosis",
y = "Average Compactness",
title = "Malignant Cell Nuclei are Less Compact",
subtitle = "Closer to Zero, the More Compact",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Malignant cell nuclei are less compact on average than benign cell nuclei. Both the smoothness and compactness seem to categorize diagnosis by the following:
Now, let’s look at the relationship between smoothness and compactness of a cell nuclei:
ggplot(mean_bc_data, aes(x = smoothness_mean, y = compactness_mean, color = diagnosis)) +
geom_point() +
geom_smooth(se = FALSE, size = 2) +
labs(x = "Average Smoothness",
y = "Average Compactness",
title = "Positive Correlation Between Smoothness and Compactness",
subtitle = "Seen in Both Benign and Malignant Cell Nuclei",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Within this section of the analysis, concavity represents the severity of concave points of the contour and the concave points represent the number of concave points of the contour. The further away it is from zero, the more severe the concave points are and the higher the total number of concave points are.
Let’s take a look at the summary data for concavity:
kable4 <- mean_bc_data %>%
summarise(
mean(concavity_mean),
min(concavity_mean),
max(concavity_mean)
)
knitr::kable(kable4)
| mean(concavity_mean) | min(concavity_mean) | max(concavity_mean) |
|---|---|---|
| 0.0887993 | 0 | 0.4268 |
Let’s visualize the relationship between concavity (severity of concave points) to diagnosis:
ggplot(mean_bc_data, aes(x = diagnosis, y = concavity_mean)) +
geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
labs(x = "Diagnosis",
y = "Average Concavity",
title = "Malignant Cell Nuclei are More Severe in Concavity",
subtitle = "Closer to Zero, Less Severity in Concave Points",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Now let’s explore the number of concave points in the same regard as severity:
kable5 <- mean_bc_data %>%
summarise(
mean(concave.points_mean),
min(concave.points_mean),
max(concave.points_mean)
)
knitr::kable(kable5)
| mean(concave.points_mean) | min(concave.points_mean) | max(concave.points_mean) |
|---|---|---|
| 0.0489191 | 0 | 0.2012 |
ggplot(mean_bc_data, aes(x = diagnosis, y = concave.points_mean)) +
geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
labs(x = "Diagnosis",
y = "Average Number of Concave Points",
title = "Malignant Cell Nuclei Have More Concave Points",
subtitle = "Closer to Zero, Less Concave Points",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Overall, the higher the number of concave points around the nuclear border, the higher the likelihood of malignancy.
Both the severity and amount of concave points indicate a positive relationship with the diagnosis.
Therefore, if we plot the above variables, we assume to see a positive relationship as well:
ggplot(mean_bc_data, aes(x = concave.points_mean, y = concavity_mean, color = diagnosis)) +
geom_point(position = "jitter", alha = 1/5) +
geom_smooth(se = FALSE, size = 2) +
geom_vline(xintercept = .0853, linetype = "dashed", color = "blue", size = 1) +
labs(x = "Average Number of Concave Points",
y = "Average Concavity",
title = "Benign Cell Nuclei has a Sudden Breakpoint",
subtitle = "Number of Concave Points can be a Good Predictor of Malignancy",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
This relationship makes sense because the higher the number of concave points, the higher the severity.
There seems to be a breakpoint in benign cancer cells (in terms of concave points) as shown by the blue dashed line
kable6 <- mean_bc_data %>%
filter(diagnosis == "B") %>%
summarise(
mean(concave.points_mean),
max(concave.points_mean) # max number of concave points is .0853. Let's add a vertical line to ggplot above at .09 to show cut off of concave points.
)
knitr::kable(kable6)
| mean(concave.points_mean) | max(concave.points_mean) |
|---|---|
| 0.0257174 | 0.08534 |
With that being said, can we use the number of concave points as a strong indicator/predictor for malignancy within a cell nuclei? Such as that an average greater than .0853 is a good indicator of malignancy? Can this be applied to the population?
Below is a Binomial Logistic Linear Regression model for predicting the diagnosis of the cell nuclei.
The model will utilize the variables used throughout this analysis (radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points) and determine which are good predictors and which are not. Good predictors are indicated by a small p-value and large effect size (Estimate).
predicted <- glm(diagnosis ~ radius_mean + perimeter_mean + area_mean + smoothness_mean + compactness_mean + concavity_mean + concave.points_mean, family = "binomial", data = mean_bc_data)
summary(predicted)
##
## Call:
## glm(formula = diagnosis ~ radius_mean + perimeter_mean + area_mean +
## smoothness_mean + compactness_mean + concavity_mean + concave.points_mean,
## family = "binomial", data = mean_bc_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.33160 -0.24636 -0.12261 0.01534 2.68460
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.77059 8.27776 0.335 0.73785
## radius_mean -3.23813 2.87701 -1.126 0.26037
## perimeter_mean 0.18040 0.39185 0.460 0.64524
## area_mean 0.03258 0.01308 2.491 0.01275 *
## smoothness_mean 27.84861 24.55626 1.134 0.25676
## compactness_mean -8.43463 12.33468 -0.684 0.49409
## concavity_mean 5.22390 7.02471 0.744 0.45709
## concave.points_mean 66.82251 23.39177 2.857 0.00428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 198.94 on 561 degrees of freedom
## AIC: 214.94
##
## Number of Fisher Scoring iterations: 8
As seen above there are two variables that are statistically significant (less than 5%):
Average Area
Average Number of Concave Points
Now we will analyze the effect size (Estimate):
Although the average area has a statistically significant p-value, the effect size is relatively small at 0.03258.
However, the average number of concave points has a statistically significant p-value and has a relatively large effect size at 66.82251.
We can conclude from our regression analysis that the average area is a good predictor but the average number of concave points is the best predictor of diagnosis (Benign vs Malignant).
Lastly, we will graph our Regression Model above and ensure it captures the expected binomial relationship between diagnosis and the variables:
probability_data <- data.frame(fitted.values = predicted$fitted.values, status = mean_bc_data$diagnosis)
probability_data <- probability_data %>%
arrange(fitted.values)
probability_data <- probability_data %>%
mutate(rank = 1:nrow(probability_data))
ggplot(probability_data, aes(x = rank, y = fitted.values, color = status)) +
geom_point(alpha = 1, shape = 1, stroke = 2) +
labs(x = "Rank",
y = "Predicted Probability of Malignancy",
title = "Predicted Probability of Malignant Cell Nuclei",
subtitle = "Closer to One, the Higher the Probability of Malignancy",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
One of the most common cancers within the female population is breast carcinoma. Understanding the correlation between nuclear morphometry and diagnosis can lead to an early and accurate diagnosis. Which can lead to early treatment. As shown in this analysis, the radius, perimeter, area, smoothness, compactness, concavity, and the number of concave points all showed a positive correlation with malignancy. As the number grew further from zero, the higher the likelihood of malignancy.
Although all morphometric features showed statistically meaningful correlations, there were only a select few that were marked as good predictors of malignancy as seen in our Binomial Linear Regression model. Overall, the average area and the number of concave points of the cell nuclei were the two good predictors of malignancy in our analysis. The reason being, the average area had a p-value of 0.013 and an effect size of 0.033 and the average number of concave points had a p-value of 0.0043 and an effect size of 66.82. Due to the large effect size and small p-value, the average number of concave points is the best predictor within our analysis.
Opportunities for further analysis could be researching the following:
Abdalla, Fathi, Jamela Boder, Abdelbaset Buhmeida, Hussein Hashmi, Adem Elzagheid, and YRJÖ COLLAN. 2008. “Nuclear Morphometry in Fnabs of Breast Disease in Libyans.” Anticancer Research 28 (6B): 3985–9.
Narasimha, Aparna, B Vasavi, and Harendra ML Kumar. 2013. “Significance of Nuclear Morphometry in Benign and Malignant Breast Aspirates.” International Journal of Applied and Basic Medical Research 3 (1): 22.
Roskell, Derek E, and Ian D Buley. 2004. “Fine Needle Aspiration Cytology in Cancer Diagnosis.” British Medical Journal Publishing Group.