I will be using Extending the Linear Model with R (Faraway), to explore solving a problem which has binary response data as well as discrete explanatory data.
It can be confusing to navigate 1s and 0s or values that belong to a small range of values. Using more complex histograms, boxplots, and jitter plots as visualizations can make these type of datasets much more comprehensive.
Data come from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration which draws only a small sample of tissue could be effective in determining tumor status.
library(faraway)
library(ggplot2)
df <- wbca # copy for transformations
str(wbca)
## 'data.frame': 681 obs. of 10 variables:
## $ Class: int 1 1 1 1 1 0 1 1 1 1 ...
## $ Adhes: int 1 5 1 1 3 8 1 1 1 1 ...
## $ BNucl: int 1 10 2 4 1 10 10 1 1 1 ...
## $ Chrom: int 3 3 3 3 3 9 3 3 1 2 ...
## $ Epith: int 2 7 2 3 2 7 2 2 2 2 ...
## $ Mitos: int 1 1 1 1 1 1 1 1 5 1 ...
## $ NNucl: int 1 2 1 7 1 7 1 1 1 1 ...
## $ Thick: int 5 5 3 6 4 8 1 2 2 4 ...
## $ UShap: int 1 4 1 8 1 10 1 2 1 1 ...
## $ USize: int 1 4 1 8 1 10 1 1 1 2 ...
We will be paying attention mostly to Class (binary) as the dependent variable - 0 is malignant and 1 is benign, and BNucl as the independent variable which has discreet values indicating bare nuclei.
The dataset wbca comes from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration, which draws only a small sample of tissue, could be effective in determining tumor status.
Explain why plot(Class ~ BNucl, wbca) does not work well.
Create a factor version of the response and produce a version of the first panel of Figure 2.1. Comment on the shape of the boxplots.
Produce a version of the second panel of Figure 2.1. What does this plot say about the distribution?
Produce a version of the interleaved histogram shown in Figure 2.2 and comment on the distribution.
graphics::plot(Class ~ BNucl, df)
Feature BNnucl has discrete values so the plot ends up having dot plots that represent many (they are overlapping). Therefore this type of visualization is not optimal because you don’t get a sense of the volume of data per dot plot.
Next we will create a factor version of the response and use a boxplot visualization
df$factor.class <- as.factor(wbca$Class)
graphics::boxplot(BNucl ~ factor.class,df)
Using the boxplot we can note via the medians (dark horizontal line) that BNucl is a good candidate for predicting cancerous tumors since depending on the values they seem to rally for one status (0 or 1) over another.
Next we will use a jitter plot to demonstrate how we can also see the distribution of overlapping data with discrete and binary values. Essentially jitter plots take data that would be overlapping and introduce marginal noise (add slight value to overlapping data points) so that you would be able to see two or more points on a plot.
graphics::plot(jitter(Class, .25) ~ jitter(BNucl), wbca, xlab="BNucl",
ylab="Class (0 if malignant, 1 if benign)", pch=".",col='purple')
The jitter plot is very successful at showing us that most of the volume of data that points to a malignant tumor (infectious) when the BNucl has a value of 1, while a benign tumor would be more likely if the BNucl has a value of 10. We are able to see this volume of data because the points are no longer overlapping, they are jittering because we added marginal value (.25 range) to those points that would otherwise be exactly the same and end up overlapping.
Finally we will use an interleaved histogram to better show the distribution of data.
ggplot(df, aes(x=BNucl, color=factor.class)) +
geom_histogram(position="dodge", binwidth=1, aes(y=..density..))
We can see from the above that bars are very useful when visualizing volume of data due to the heights of the bars. By classifying the bars themselves (blue for benign and red for malignant) and using density distribution as our y-axis, it becomes very easy to see which bare nuclei tend to predict a benign or malignant tumor.
We can see that a value of 1 for BNucl has over a .75 density indicating benign tumors (colored blue) while values in the upper range for BNucl will point to malignant tumors (red outline), where a value of 10 has over .5 density.