library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ComplexHeatmap)
library(circlize)
a) Import the dataset into R, using readr. What
is the problem with the 33rd column causing a warning? Fix this problem
in the data file, then try again. Import id as a character, and
diagnosis as a factor with values B and M.
The 33rd column in the dataset is a logical variable with all 568
observations containing NAs. We can see that the column’s
name is “…33”. This tells us that it was most likely a mistake and it
doesn’t seem like this error caused other errors in the dataset.
Therefore we can simply remove this column. I removed the column
directly from the datafile. However, we could have also removed the
column using code. After removing the column, we import the data with
id as a character and diagnosis as a
factor.
# Import dataset: ID as a character and Diagnosis as a factor
br_cancer = read_csv("~/OneDrive/CUNY SPH/Spring 2023/Bios 620/Lab 6/cancer.csv",
col_types = cols(
id = col_character(),
diagnosis = col_factor()
))
b) Recode the “diagnosis” column to the more informative values “benign” and “malignant”, with reference level “benign”
# recode values using case_match
br_cancer = br_cancer %>%
mutate(diagnosis = case_match(diagnosis,
"B" ~ "benign", "M" ~ "malignant",
.ptype = factor(levels = c("benign", "malignant"))))
# Check dimensions of the data
dim(br_cancer)
## [1] 569 32
# First 10 rows of the data
head(br_cancer, n = 10)
## # A tibble: 10 × 32
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## <chr> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 842302 malignant 18.0 10.4 123. 1001
## 2 842517 malignant 20.6 17.8 133. 1326
## 3 84300903 malignant 19.7 21.2 130 1203
## 4 84348301 malignant 11.4 20.4 77.6 386.
## 5 84358402 malignant 20.3 14.3 135. 1297
## 6 843786 malignant 12.4 15.7 82.6 477.
## 7 844359 malignant 18.2 20.0 120. 1040
## 8 84458202 malignant 13.7 20.8 90.2 578.
## 9 844981 malignant 13 21.8 87.5 520.
## 10 84501001 malignant 12.5 24.0 84.0 476.
## # ℹ 26 more variables: smoothness_mean <dbl>, compactness_mean <dbl>,
## # concavity_mean <dbl>, `concave points_mean` <dbl>, symmetry_mean <dbl>,
## # fractal_dimension_mean <dbl>, radius_se <dbl>, texture_se <dbl>,
## # perimeter_se <dbl>, area_se <dbl>, smoothness_se <dbl>,
## # compactness_se <dbl>, concavity_se <dbl>, `concave points_se` <dbl>,
## # symmetry_se <dbl>, fractal_dimension_se <dbl>, radius_worst <dbl>,
## # texture_worst <dbl>, perimeter_worst <dbl>, area_worst <dbl>, …
# Summarize all variables in the data
summary(br_cancer)
## id diagnosis radius_mean texture_mean
## Length:569 benign :357 Min. : 6.981 Min. : 9.71
## Class :character malignant:212 1st Qu.:11.700 1st Qu.:16.17
## Mode :character Median :13.370 Median :18.84
## Mean :14.127 Mean :19.29
## 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.06154 Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.08880 Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.42680 Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se concave points_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638
## Median :0.006380 Median :0.020450 Median :0.02589 Median :0.010930
## Mean :0.007041 Mean :0.025478 Mean :0.03189 Mean :0.011796
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710
## Max. :0.031130 Max. :0.135400 Max. :0.39600 Max. :0.052790
## symmetry_se fractal_dimension_se radius_worst texture_worst
## Min. :0.007882 Min. :0.0008948 Min. : 7.93 Min. :12.02
## 1st Qu.:0.015160 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08
## Median :0.018730 Median :0.0031870 Median :14.97 Median :25.41
## Mean :0.020542 Mean :0.0037949 Mean :16.27 Mean :25.68
## 3rd Qu.:0.023480 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72
## Max. :0.078950 Max. :0.0298400 Max. :36.04 Max. :49.54
## perimeter_worst area_worst smoothness_worst compactness_worst
## Min. : 50.41 Min. : 185.2 Min. :0.07117 Min. :0.02729
## 1st Qu.: 84.11 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720
## Median : 97.66 Median : 686.5 Median :0.13130 Median :0.21190
## Mean :107.26 Mean : 880.6 Mean :0.13237 Mean :0.25427
## 3rd Qu.:125.40 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910
## Max. :251.20 Max. :4254.0 Max. :0.22260 Max. :1.05800
## concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
## Min. :0.0000 Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.1145 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2267 Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.2722 Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3829 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :1.2520 Max. :0.29100 Max. :0.6638 Max. :0.20750
Based on the outputs, we can see that there are 569 observations and
32 variables in the data. We can look at the first 10 rows of
observations by using head() and summarize each variable
using summary(). Based on the description of the data set,
there are ten real-value features and the rest are computed based on
these variables. For example, the radius_mean is one
variable, radius_se is a variable that contains the
standard error, and the radius_worst contains the “worst”
or largest (mean of the three largest values) for each of the images. In
other words, each participant had multiple images of a mass taken and
these values represent the mean, standard error, and worst values for
those images. Therefore, we can already see that some variables will be
correlated since they are computed using the same set of numbers. For
example, radius_mean, radius_se, and
radius_worst will be correlated. We do not know how large
their correlation is, however, we can already make an educated guess
that it will be not be zero.
The data description also states that there are no missing values. We can double check by using the following piece of code:
sum(is.na(br_cancer))
## [1] 0
Plot the distance matrix to identify the most highly correlated or anticorrelated variables.
# Distance matrix of correlation coefficients
cor_dist = 1 - abs(br_cancer %>% select(-c("id", "diagnosis")) %>% cor())
# Set up color palette for heatmap
col_fun = colorRamp2(c(0, 1), c("purple", "yellow"))
ComplexHeatmap::Heatmap(cor_dist, col = col_fun)
The diagonal of the matrix will have a distance of 0, meaning they
will be perfectly correlated. This is because a variable will always be
perfectly correlated with itself (\(\rho =
1\)). Some variables such as the area_mean,
radius_mean, perimeter_mean, and their
corresponding worst values are highly correlated (their distance is less
than 0.5). We can see that most variables that are correlated are on the
bottom right section of the heatmap. These variables measured concavity,
area, radius, perimeter, compactness, among others. We can also see how
values such as texture are anticorrelated with variables that measure
size, compactness, and concavity of the mass.
Create a box plot for each column except for id.
Each boxplot should have two boxes, one for benign and one for
malignant, allowing visual comparison of the distribution of that
variable for benign and malignant specimens. Do any variables clearly
have an association with breast cancer diagnosis?
# Creating a new dataset without the id variable
long_br_cancer = br_cancer %>%
select(-id) %>%
pivot_longer(cols = !diagnosis,
names_to = "variable",
values_to = "values")
# Create boxplots for all the variables
ggplot(long_br_cancer, aes(x = diagnosis, y = values)) +
geom_boxplot() +
facet_wrap( ~ variable, scales = "free") +
theme_bw() +
labs(y = "", x = "Diagnosis")
Variables that deal with area, concavity, concave points, perimeter, and radius have a clear association with diagnosis of breast cancer. The boxplots clearly show that higher values for these variables suggests a higher probability that these masses are malignant.
Fit a univariate logistic regression model using
area_mean as the predictor and diagnosis as
the outcome, using the glm() function
# Logistic regression
log_reg = glm(diagnosis ~ area_mean, data = br_cancer, family = "binomial")
# Summary of model
summary(log_reg)
##
## Call:
## glm(formula = diagnosis ~ area_mean, family = "binomial", data = br_cancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.72957 -0.46420 -0.20420 0.09862 2.73598
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.97409 0.68286 -11.68 <2e-16 ***
## area_mean 0.01177 0.00109 10.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 325.66 on 567 degrees of freedom
## AIC: 329.66
##
## Number of Fisher Scoring iterations: 7
Logistic regression provides us with the log odds of the outcome. In
order to get the odds, we need to exponentiate the coefficient.
Therefore, the odds are 1.0118. Since the odds are greater than 1, it
suggests that the odds of it being a malignant mass is higher for higher
values of area_mean. We can use the definition of the
logistic function to find specific probabilities. For example, what is
the probability of the mass being diagnosed as malignant if the area
mean is 100?
\[p(x) = \frac{e^{\hat{\beta_0} + \hat{\beta_1X}}}{1 + e^{\hat{\beta_0} + \hat{\beta_1X}}}\] \[p(x = 100) = \frac{e^{-7.97409 + 0.01177 * 100}}{1 + e^{-7.97409 + 0.01177 * 100}} = 0.0011\]
a) Create a xy plot with area_mean on the x axis and diagnosis on the y axis. Use geom_jitter(width = 0) to create some spread on the y-axis so you can see where the points are without changing the x values.
area_mean_scplot = ggplot(br_cancer, aes(x = area_mean, y = diagnosis)) +
geom_point() +
geom_jitter(width = 0) +
theme_bw() +
labs(y = "Diagnosis",
x = "Area Mean")
area_mean_scplot
b) Now add to this plot the predicted probabilities for each observed temperature, using the predict function with argument type=“response”
# Creating prediction probabilities for each observation
# Add 1 to the probabilities to make them to the same scale
br_cancer = br_cancer %>% mutate(prediction = predict(log_reg, type = "response") + 1)
# Add prediction to the graph
area_mean_scplot +
geom_line(data = br_cancer, aes(x = area_mean, y = prediction),
color = "red", linewidth = 1.5)
c) What do you notice about the data points where the predicted probabilities are 0, 0.5, and 1?
Looking at the graph, we notice that when area_mean is
small, the probabilities approach zero. However, when the
area_mean is higher, the probabilities approach one.
Therefore, we would expect observations with higher
area_mean to be predominantly malignant according to our
model. We can double check this by looking at the number of observations
that are benign or malignant when their predicted probabilities are
greater than 0.90, for example.
# Predicted probabilities are greater than 0.90
table(br_cancer %>% filter((prediction - 1) > 0.90) %>% select(diagnosis))
## diagnosis
## benign malignant
## 2 120
We can see that only 2 benign observations were predicted very high
probabilities of being malignant. This is expected since those two cases
have a very high area_mean. We would expect the opposite to
happen when the area_mean is smaller. We would expect more
observations with benign diagnosis than malignant. We can check the
observations with a predicted probability of less than 0.10
# Predicted probabilities are less than 0.10
table(br_cancer %>% filter((prediction - 1) < 0.10) %>% select(diagnosis))
## diagnosis
## benign malignant
## 212 9
Only 9 cases were diagnosed as malignant with small
area_mean. As expected, observations were small
area_mean are predominantly benign. Let’s see at those
observations with predicted probabilities between 0.45 and 0.55. We do
not want to check observations with a exact probability of 0.50 since it
might yield no observations.
# Predicted probabilities around 0.50
table(br_cancer %>% filter((prediction - 1) > 0.45 & (prediction - 1) < 0.55) %>% select(diagnosis))
## diagnosis
## benign malignant
## 18 7
Observations with probabilities around 0.50 are more difficult to diagnose as malignant or benign. An observation with a predicted probaility of 0.45 is benign whereas another one with a predicted probability of 0.46 is malignant. Therefore, we can’t draw too many conclusions for these type of observations and we would need to adjust our model to better capture these type of observations.
d) What are the predicted probabilities of a malignant diagnosis when area_mean is 300, 500, 700, 900, and 1100?
# Create a tibble with the above sequence and predictions
predict_data = tibble(area_mean = seq(300, 1100, 200))
predict_data %>% mutate(prediction = predict(log_reg, newdata = predict_data,type = "response"))
## # A tibble: 5 × 2
## area_mean prediction
## <dbl> <dbl>
## 1 300 0.0116
## 2 500 0.110
## 3 700 0.565
## 4 900 0.932
## 5 1100 0.993