Load Libraries

library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ComplexHeatmap)
library(circlize)

Question 1: Importing the dataset

a) Import the dataset into R, using readr. What is the problem with the 33rd column causing a warning? Fix this problem in the data file, then try again. Import id as a character, and diagnosis as a factor with values B and M.

The 33rd column in the dataset is a logical variable with all 568 observations containing NAs. We can see that the column’s name is “…33”. This tells us that it was most likely a mistake and it doesn’t seem like this error caused other errors in the dataset. Therefore we can simply remove this column. I removed the column directly from the datafile. However, we could have also removed the column using code. After removing the column, we import the data with id as a character and diagnosis as a factor.

# Import dataset: ID as a character and Diagnosis as a factor
br_cancer = read_csv("~/OneDrive/CUNY SPH/Spring 2023/Bios 620/Lab 6/cancer.csv",
                     col_types = cols(
                       id = col_character(),
                       diagnosis = col_factor()
                     ))

b) Recode the “diagnosis” column to the more informative values “benign” and “malignant”, with reference level “benign”

# recode values using case_match
br_cancer = br_cancer %>% 
  mutate(diagnosis = case_match(diagnosis, 
                                "B" ~ "benign", "M" ~ "malignant",
                                .ptype = factor(levels = c("benign", "malignant"))))

Question 2: Exploratory Data Analysis (EDA)

# Check dimensions of the data
dim(br_cancer)
## [1] 569  32
# First 10 rows of the data
head(br_cancer, n = 10)
## # A tibble: 10 × 32
##    id       diagnosis radius_mean texture_mean perimeter_mean area_mean
##    <chr>    <fct>           <dbl>        <dbl>          <dbl>     <dbl>
##  1 842302   malignant        18.0         10.4          123.      1001 
##  2 842517   malignant        20.6         17.8          133.      1326 
##  3 84300903 malignant        19.7         21.2          130       1203 
##  4 84348301 malignant        11.4         20.4           77.6      386.
##  5 84358402 malignant        20.3         14.3          135.      1297 
##  6 843786   malignant        12.4         15.7           82.6      477.
##  7 844359   malignant        18.2         20.0          120.      1040 
##  8 84458202 malignant        13.7         20.8           90.2      578.
##  9 844981   malignant        13           21.8           87.5      520.
## 10 84501001 malignant        12.5         24.0           84.0      476.
## # ℹ 26 more variables: smoothness_mean <dbl>, compactness_mean <dbl>,
## #   concavity_mean <dbl>, `concave points_mean` <dbl>, symmetry_mean <dbl>,
## #   fractal_dimension_mean <dbl>, radius_se <dbl>, texture_se <dbl>,
## #   perimeter_se <dbl>, area_se <dbl>, smoothness_se <dbl>,
## #   compactness_se <dbl>, concavity_se <dbl>, `concave points_se` <dbl>,
## #   symmetry_se <dbl>, fractal_dimension_se <dbl>, radius_worst <dbl>,
## #   texture_worst <dbl>, perimeter_worst <dbl>, area_worst <dbl>, …
# Summarize all variables in the data
summary(br_cancer)
##       id                diagnosis    radius_mean      texture_mean  
##  Length:569         benign   :357   Min.   : 6.981   Min.   : 9.71  
##  Class :character   malignant:212   1st Qu.:11.700   1st Qu.:16.17  
##  Mode  :character                   Median :13.370   Median :18.84  
##                                     Mean   :14.127   Mean   :19.29  
##                                     3rd Qu.:15.780   3rd Qu.:21.80  
##                                     Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.06154   Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se     concave points_se 
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638  
##  Median :0.006380   Median :0.020450   Median :0.02589   Median :0.010930  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189   Mean   :0.011796  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600   Max.   :0.052790  
##   symmetry_se       fractal_dimension_se  radius_worst   texture_worst  
##  Min.   :0.007882   Min.   :0.0008948    Min.   : 7.93   Min.   :12.02  
##  1st Qu.:0.015160   1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08  
##  Median :0.018730   Median :0.0031870    Median :14.97   Median :25.41  
##  Mean   :0.020542   Mean   :0.0037949    Mean   :16.27   Mean   :25.68  
##  3rd Qu.:0.023480   3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72  
##  Max.   :0.078950   Max.   :0.0298400    Max.   :36.04   Max.   :49.54  
##  perimeter_worst    area_worst     smoothness_worst  compactness_worst
##  Min.   : 50.41   Min.   : 185.2   Min.   :0.07117   Min.   :0.02729  
##  1st Qu.: 84.11   1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720  
##  Median : 97.66   Median : 686.5   Median :0.13130   Median :0.21190  
##  Mean   :107.26   Mean   : 880.6   Mean   :0.13237   Mean   :0.25427  
##  3rd Qu.:125.40   3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910  
##  Max.   :251.20   Max.   :4254.0   Max.   :0.22260   Max.   :1.05800  
##  concavity_worst  concave points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.0000   Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.1145   1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2267   Median :0.09993      Median :0.2822   Median :0.08004        
##  Mean   :0.2722   Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3829   3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :1.2520   Max.   :0.29100      Max.   :0.6638   Max.   :0.20750

Based on the outputs, we can see that there are 569 observations and 32 variables in the data. We can look at the first 10 rows of observations by using head() and summarize each variable using summary(). Based on the description of the data set, there are ten real-value features and the rest are computed based on these variables. For example, the radius_mean is one variable, radius_se is a variable that contains the standard error, and the radius_worst contains the “worst” or largest (mean of the three largest values) for each of the images. In other words, each participant had multiple images of a mass taken and these values represent the mean, standard error, and worst values for those images. Therefore, we can already see that some variables will be correlated since they are computed using the same set of numbers. For example, radius_mean, radius_se, and radius_worst will be correlated. We do not know how large their correlation is, however, we can already make an educated guess that it will be not be zero.

The data description also states that there are no missing values. We can double check by using the following piece of code:

sum(is.na(br_cancer))
## [1] 0

Plot the distance matrix to identify the most highly correlated or anticorrelated variables.

# Distance matrix of correlation coefficients
cor_dist = 1 - abs(br_cancer %>% select(-c("id", "diagnosis")) %>% cor())

# Set up color palette for heatmap
col_fun = colorRamp2(c(0, 1), c("purple", "yellow"))
ComplexHeatmap::Heatmap(cor_dist, col = col_fun)

The diagonal of the matrix will have a distance of 0, meaning they will be perfectly correlated. This is because a variable will always be perfectly correlated with itself (\(\rho = 1\)). Some variables such as the area_mean, radius_mean, perimeter_mean, and their corresponding worst values are highly correlated (their distance is less than 0.5). We can see that most variables that are correlated are on the bottom right section of the heatmap. These variables measured concavity, area, radius, perimeter, compactness, among others. We can also see how values such as texture are anticorrelated with variables that measure size, compactness, and concavity of the mass.

Create a box plot for each column except for id. Each boxplot should have two boxes, one for benign and one for malignant, allowing visual comparison of the distribution of that variable for benign and malignant specimens. Do any variables clearly have an association with breast cancer diagnosis?

# Creating a new dataset without the id variable
long_br_cancer = br_cancer %>%
  select(-id) %>%
  pivot_longer(cols = !diagnosis,
               names_to = "variable",
               values_to = "values")

# Create boxplots for all the variables
ggplot(long_br_cancer, aes(x = diagnosis, y = values)) +
  geom_boxplot() +
  facet_wrap( ~ variable, scales = "free") +
  theme_bw() +
  labs(y = "", x = "Diagnosis")

Variables that deal with area, concavity, concave points, perimeter, and radius have a clear association with diagnosis of breast cancer. The boxplots clearly show that higher values for these variables suggests a higher probability that these masses are malignant.

Question 3: Building a Logistic Regression Model

Fit a univariate logistic regression model using area_mean as the predictor and diagnosis as the outcome, using the glm() function

# Logistic regression
log_reg = glm(diagnosis ~ area_mean, data = br_cancer, family = "binomial")

# Summary of model
summary(log_reg)
## 
## Call:
## glm(formula = diagnosis ~ area_mean, family = "binomial", data = br_cancer)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.72957  -0.46420  -0.20420   0.09862   2.73598  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.97409    0.68286  -11.68   <2e-16 ***
## area_mean    0.01177    0.00109   10.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.44  on 568  degrees of freedom
## Residual deviance: 325.66  on 567  degrees of freedom
## AIC: 329.66
## 
## Number of Fisher Scoring iterations: 7

Logistic regression provides us with the log odds of the outcome. In order to get the odds, we need to exponentiate the coefficient. Therefore, the odds are 1.0118. Since the odds are greater than 1, it suggests that the odds of it being a malignant mass is higher for higher values of area_mean. We can use the definition of the logistic function to find specific probabilities. For example, what is the probability of the mass being diagnosed as malignant if the area mean is 100?

\[p(x) = \frac{e^{\hat{\beta_0} + \hat{\beta_1X}}}{1 + e^{\hat{\beta_0} + \hat{\beta_1X}}}\] \[p(x = 100) = \frac{e^{-7.97409 + 0.01177 * 100}}{1 + e^{-7.97409 + 0.01177 * 100}} = 0.0011\]

Question 4: Making Predictions

a) Create a xy plot with area_mean on the x axis and diagnosis on the y axis. Use geom_jitter(width = 0) to create some spread on the y-axis so you can see where the points are without changing the x values.

area_mean_scplot = ggplot(br_cancer, aes(x = area_mean, y = diagnosis)) +
  geom_point() +
  geom_jitter(width = 0) +
  theme_bw() +
  labs(y = "Diagnosis",
       x = "Area Mean")

area_mean_scplot

b) Now add to this plot the predicted probabilities for each observed temperature, using the predict function with argument type=“response”

# Creating prediction probabilities for each observation
# Add 1 to the probabilities to make them to the same scale
br_cancer = br_cancer %>% mutate(prediction = predict(log_reg, type = "response") + 1)

# Add prediction to the graph
area_mean_scplot + 
  geom_line(data = br_cancer, aes(x = area_mean, y = prediction), 
            color = "red", linewidth = 1.5)

c) What do you notice about the data points where the predicted probabilities are 0, 0.5, and 1?

Looking at the graph, we notice that when area_mean is small, the probabilities approach zero. However, when the area_mean is higher, the probabilities approach one. Therefore, we would expect observations with higher area_mean to be predominantly malignant according to our model. We can double check this by looking at the number of observations that are benign or malignant when their predicted probabilities are greater than 0.90, for example.

# Predicted probabilities are greater than 0.90
table(br_cancer %>% filter((prediction - 1) > 0.90) %>% select(diagnosis))
## diagnosis
##    benign malignant 
##         2       120

We can see that only 2 benign observations were predicted very high probabilities of being malignant. This is expected since those two cases have a very high area_mean. We would expect the opposite to happen when the area_mean is smaller. We would expect more observations with benign diagnosis than malignant. We can check the observations with a predicted probability of less than 0.10

# Predicted probabilities are less than 0.10
table(br_cancer %>% filter((prediction - 1) < 0.10) %>% select(diagnosis))
## diagnosis
##    benign malignant 
##       212         9

Only 9 cases were diagnosed as malignant with small area_mean. As expected, observations were small area_mean are predominantly benign. Let’s see at those observations with predicted probabilities between 0.45 and 0.55. We do not want to check observations with a exact probability of 0.50 since it might yield no observations.

# Predicted probabilities around 0.50
table(br_cancer %>% filter((prediction - 1) > 0.45 & (prediction - 1) < 0.55) %>% select(diagnosis))
## diagnosis
##    benign malignant 
##        18         7

Observations with probabilities around 0.50 are more difficult to diagnose as malignant or benign. An observation with a predicted probaility of 0.45 is benign whereas another one with a predicted probability of 0.46 is malignant. Therefore, we can’t draw too many conclusions for these type of observations and we would need to adjust our model to better capture these type of observations.

d) What are the predicted probabilities of a malignant diagnosis when area_mean is 300, 500, 700, 900, and 1100?

# Create a tibble with the above sequence and predictions
predict_data = tibble(area_mean = seq(300, 1100, 200))

predict_data %>% mutate(prediction = predict(log_reg, newdata = predict_data,type = "response"))
## # A tibble: 5 × 2
##   area_mean prediction
##       <dbl>      <dbl>
## 1       300     0.0116
## 2       500     0.110 
## 3       700     0.565 
## 4       900     0.932 
## 5      1100     0.993