Background

A good friend of yours named Jennifer recently went to the doctor to get a checkup. After the checkup, she tells you something nobody wants to hear from their friend after a doctor’s visit. “They found something.”

The doctor has found a tumor in her breast that could be cancerous. The doctors, however, weren’t sure if it was cancerous or not. If it’s cancerous, your friend has to go through rounds of chemo therapy and have to pay for mountains of medical bills. There’s also a chance that the tumor is harmless.

This diagnosis means everything to your friend!

You remember learning all about logistic regression in your statistics class class, and have great trust in your statistical analysis skills! You decide to build a logistic regression model to predict whether or not her tumor is cancerous! Perhaps you can save her from a lot of unnecessary suffering.


The doctors give you some data that they gleaned from some 3D images of the tumor cells. You find out that her tumor cell nuclei have an area_mean of 511.

Luckily, you found some data on Kaggle: Breast Cancer Wisconsen (Diagnostic) Data Set that has the attribute area_mean in it! It contains data describing 10 different measurements of cell nuclei present in 357 benign tumors (not cancerous), and 212 malignant tumors (cancerous).

Ultimately, you are trying to answer the following question:

What is the probability that Jennifer’s tumor is cancerous?

You decide to train a logistic regression model to predict if her tumor is harmless or fatal. Let’s get started!

Analysis

You start by first taking a peak at the data. (We’ll only be using the area_mean for this analysis for simplicity’s sake).

#Bringing in our libraries
library(tidyverse)
library(kableExtra)

#Bringing in our data and cleaning it up a bit
data <- read_csv("C:/Users/zacha/Desktop/Math 325 Notebook/Math 325 Notebook/Data/breast_cancer_diagnostic_data.csv") %>% 
  select(diagnosis, area_mean) %>%
  mutate(diagnosis = ifelse(diagnosis == "M", "Cancerous", "Not Cancerous")) %>%
  as_data_frame()

data %>% head()
## # A tibble: 6 x 2
##   diagnosis area_mean
##   <chr>         <dbl>
## 1 Cancerous     1001 
## 2 Cancerous     1326 
## 3 Cancerous     1203 
## 4 Cancerous      386.
## 5 Cancerous     1297 
## 6 Cancerous      477.

Great!

Let’s start by visualizing the distribution of area mean for both cancerous and not cancerous tumor cells using boxplots. This will help us see if the means are consistently different between cancerous and non-cancerous tumors.

We will also run a quick t-test to compare the means and see what our p-value is. The p-value will show up on the plot above the two boxplots.

library(ggpubr)

data %>%
    ggplot(aes(x = diagnosis, 
               y = area_mean, 
               color = diagnosis,
               fill = diagnosis)) +
        geom_boxplot(alpha=.25) + 
        stat_compare_means(inherit.aes = TRUE, 
                           label = "p.format", 
                            method = "t.test", 
                            label.x = 1.5,
                            show.legend=FALSE) +
        theme_bw() +
        labs(x = "Diagnosis",
             y = "Area Mean",
             title = "Cancerous Tumors Have a Much Higher Cell Nuclei Area Mean")

There’s a pretty large gap between the area_mean for cancerous tumors and the area_mean for not cancerous tumors! We also have an extremely small p-value, which implies that the means of are very consistently different.

We can also visualize our cancerous and not cancerous measurements with density plots:

data %>%
    ggplot(aes(x = area_mean,
               color = diagnosis,
               fill = diagnosis)) +
        geom_density(alpha=.25) +
        theme_bw() +
        labs(x = "Area Mean",
             y = "Density",
             title = "Cancerous Tumors Have a Much Higher Cell Nuclei Area Mean")

Insights:

Creating Our Model

To best predict a binary outcome (1 or 0, TRUE or FALSE), we’ll run a logistic regression model.

Here’s all the mathy parts of what we’re going to do. The equation for a logistic regression model is as follows:

\[ P(Y_i = 1|x_i) = \frac{e^{\beta_0+\beta_1 x_i}}{1+e^{\beta_0 + \beta_1 x_i}} = \pi_i \] where for observation \(i\):

  • \(Y_i = 1\) denotes that the tumor is cancerous,
  • \(Y_i=0\) denotes that the tumor is not cancerous, and
  • \(x_i\) denotes the area_mean of the cell nuclei in the tumor.

Note that if \(\beta_1\) is zero in the above model, then \(x_i\) (area_mean) provides no insight about the probability of a tumor being cancerous.

Thus, we could test the hypothesis that

\[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \]

In simple english: if the area_mean truly does help us predict if a breast tumor is cancerous, then it will show up in our formula as a number greater than 0!

Fitting the Model

Here’s where we fit our model! Remember, our input variable is the area_mean and our output variable is the diagnosis.

Let’s see what happens!

#Creating a column with 1's and 0's instead of "Cancerous" and "Not Cancerous"
data <- data %>% mutate(diagnosis_0_1 = ifelse(diagnosis == "Cancerous", 1, 0))

#Splitting our data into 80% training and 20% testing data sets
library(caret)
training <- createDataPartition(data$diagnosis_0_1, p=0.8, list=FALSE)
train <- data[ training, ]
test <- data[ -training, ]

#fitting our model
breast_cancer_glm <- glm(diagnosis_0_1 ~ area_mean, data = train, family = "binomial")
summary(breast_cancer_glm)
## 
## Call:
## glm(formula = diagnosis_0_1 ~ area_mean, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7563  -0.4715  -0.2117   0.1226   2.7544  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.098550   0.769236 -10.528   <2e-16 ***
## area_mean    0.011969   0.001222   9.793   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 598.00  on 455  degrees of freedom
## Residual deviance: 266.99  on 454  degrees of freedom
## AIC: 270.99
## 
## Number of Fisher Scoring iterations: 6

Just as we found before, our p-value is very low, meaning that we can use this feature in our model with confidence.

Visualizing the Model

To visualize this simple logistic regression we could make the following plot.

plot(diagnosis_0_1  ~ area_mean, 
     data=train, 
     main="Logistic Regression Curve - Predicting Breast Cancer", 
     ylab='Probability of Tumor Being Cancerous', 
     pch=16)

curve(
  exp(breast_cancer_glm$coef[1]+breast_cancer_glm$coef[2]*x)/
      (1+exp(breast_cancer_glm$coef[1]+breast_cancer_glm$coef[2]*x)), 
      add=TRUE
      )


Let’s Make Sure The Model is Valid

To check for goodness of fit, we’ll use the hoslem test:

#We have very few repeating values, so we will use the hoslem test to check for goodness of fit
library(ResourceSelection)
library(pander)
hoslem.test(breast_cancer_glm$y, breast_cancer_glm$fitted.values) %>% pander()
Hosmer and Lemeshow goodness of fit (GOF) test: breast_cancer_glm$y, breast_cancer_glm$fitted.values
Test statistic df P value
9.401 8 0.3096

Great! Remember that our p-value for the hosmer test is that our logistic regression is a good fit for our model. Because the p-value (0.4391) is greater than 0.05, we fail to reject this null and conclude that our model is a very good fit for this data.

Interpretation

Here’s what the mathematical model for \(\pi_i\) would now look like:

\[ P(Y_i = 1|x_i) \approx \frac {e^{-8.1-0.01x_i} } {1+e^{-8.1 - 0.01x_i} } = \hat{\pi}_i \]

Where \(b_0\) = -8.1 is the value of the Intercept which estimates \(\beta_0\) and \(b_1\) = 0.01 is the value of area_mean which estimates \(\beta_1\). Because the p-value for the logistic regression is <0.05, we can reject our null hypothesis that area_mean doesn’t predict the diagnosis. It actually does!

But, by how much?

Well, in our model’s case, the y-intercept isn’t very helpful because none of the area_mean’s were 0. That would imply that the tumor simply didn’t exist. However, the value of \(e^{b_1} = e^{0.0117} \approx 1.012\) shows that the odds of a tumor being cancerous increase by approximately 1.012 for every increase of 1 in area_mean.

How Accurate is our Model?

We can test our model’s accuracy by predicting our test-data values with our model and seeing how many we got right!

#Making predictions
train$pred_prob <- predict(breast_cancer_glm, data=train, type='response')
train$pred <- ifelse(train$pred_prob >= .5, 1, 0)

#Testing our predictions against our training set
results <- train %>% 
    mutate(correct = (diagnosis_0_1 == pred)) %>% 
    group_by(correct) %>% 
    tally()

accuracy <- round(results$n[2]/(nrow(train)),3) #Calculating accuracy

print(paste0("We made ", results$n[2], " correct predictions,"))
## [1] "We made 406 correct predictions,"
print(paste0(results$n[1], " incorrect predictions, "))
## [1] "50 incorrect predictions, "
print(paste0("thus giving us an accuracy rating of: ", accuracy*100, "%"))
## [1] "thus giving us an accuracy rating of: 89%"

That’s not too bad for such a simple model!

Using the Model to Predict Jennifer’s Diagnosis

So let’s put this model to the test! Recall that the area_mean of Jennifer’s tumor cell nuclei is 511 (not sure what units they were using in the data).

prediction <- predict(breast_cancer_glm, data_frame(area_mean = 511), type="response")
(paste0("According to our model, the probability of Jennifer's tumor being cancerous is ", round(prediction[1], 4)*100, "%."))
## [1] "According to our model, the probability of Jennifer's tumor being cancerous is 12.11%."

With such a low probability, you diagnose your friend as “not cancerous”!

You show the doctors your results, and they decide against performing chemo-therapy. Your friend thanks you from preventing treatment that would have put her through much suffering and a mountain of medical bills!