#install.packages("dslabs")
library("dslabs")
library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ dplyr 1.0.10
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(dplyr)
#data(package="dslabs")
The data collected is from digitized images of a FNA of a breast mass. Each observation is categorized as either “B” for Benign or “M” for malignant.
Read in the Breast Cancer Wisconsin Diagnostic Dataset from UCI Machine Learning Repository
# read in, the save dataset and then assign it as brca
data("brca")
write.csv(brca, "brca.csv", na="")
brca <- read_csv("brca.csv")
## New names:
## Rows: 569 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): y dbl (31): ...1, x.radius_mean, x.texture_mean, x.perimeter_mean,
## x.area_mean...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# look at the first 6 observations
head(brca)
## # A tibble: 6 × 32
## ...1 x.radi…¹ x.tex…² x.per…³ x.are…⁴ x.smo…⁵ x.com…⁶ x.con…⁷ x.con…⁸ x.sym…⁹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 13.5 14.4 87.5 566. 0.0978 0.0813 0.0666 0.0478 0.188
## 2 2 13.1 15.7 85.6 520 0.108 0.127 0.0457 0.0311 0.197
## 3 3 9.50 12.4 60.3 274. 0.102 0.0649 0.0296 0.0208 0.182
## 4 4 13.0 18.4 82.6 524. 0.0898 0.0377 0.0256 0.0292 0.147
## 5 5 8.20 16.8 51.7 202. 0.086 0.0594 0.0159 0.00592 0.177
## 6 6 12.0 14.6 78.0 449. 0.103 0.0909 0.0659 0.0275 0.168
## # … with 22 more variables: x.fractal_dim_mean <dbl>, x.radius_se <dbl>,
## # x.texture_se <dbl>, x.perimeter_se <dbl>, x.area_se <dbl>,
## # x.smoothness_se <dbl>, x.compactness_se <dbl>, x.concavity_se <dbl>,
## # x.concave_pts_se <dbl>, x.symmetry_se <dbl>, x.fractal_dim_se <dbl>,
## # x.radius_worst <dbl>, x.texture_worst <dbl>, x.perimeter_worst <dbl>,
## # x.area_worst <dbl>, x.smoothness_worst <dbl>, x.compactness_worst <dbl>,
## # x.concavity_worst <dbl>, x.concave_pts_worst <dbl>, …
Change diagnosis abbreviation
brca$y[brca$y == "B"] <- "Benign"
brca$y[brca$y == "M"] <- "Malignant"
Add a new numeric column representing if a mass is cancerous or not (0 = not cancerous/benign; 1 = cancerous/malignant)
brca_new <- brca %>%
mutate(is_cancerous = (brca$y == "Malignant") * 1)
Create a dataset that includes only the variables explored (means) and the numeric column “is_cancerous”
onlyMean <- brca_new %>%
select(2:11, 33)
Create a correlation heatmap between the diagnosis of a mass with the variables recorded (mean radius, mean texture, etc.)
library(DataExplorer)
plot_correlation(onlyMean)
Create histograms comparing the data of masses with the highest and lowest correlation with whether a tumor is benign or malignant
# Histogram
hist1 <- brca %>%
# set x scale as average concave points and fill by diagnosis
ggplot(aes(x=x.concave_pts_mean, fill=y)) +
geom_histogram(position= "identity", bins=30) +
scale_fill_manual(name = "Diagnosis", values = c("Benign" = 'cyan',
"Malignant" = 'hotpink')) +
ggtitle("Average Concave points of Masses of\n Benign vs Malignant Tumors") +
xlab("Average Concave Points") +
ylab("Frequency") +
theme_dark()
hist1
hist2 <- brca %>%
# set x scale as average fractal dimension and fill by diagnosis
ggplot(aes(x=x.fractal_dim_mean, fill=y)) +
geom_histogram(position= "identity", bins=30, color="white") +
scale_fill_manual(name = "Diagnosis", values = c("Benign" = 'cyan',
"Malignant" = 'purple')) +
ggtitle("Average Fractal Dimension of Masses of\n Benign vs Malignant Tumors") +
xlab("Average Fractal Dimension") +
ylab("Frequency") +
theme_classic()
hist2
In order to facet using columns as measure, I will reformat the data from “wide” to “long” and only include 3 variables to see whether or not they are highly correlated with concavity points (thus, benign and malignant diagnoses).
brcaLong <- gather(brca_new, key = "measure", value = "value", c("x.texture_mean", "x.smoothness_mean", "x.area_mean"))
One way to display the data is to facet by measurements and to separate by diagnosis.
ggplot(brcaLong, aes(x=value, y=x.concave_pts_mean)) +
ylab("Concavity Points") +
ggtitle("Effect of Mass Measurements on Concavity Points") +
geom_point(alpha=1, size = 1.5) +
geom_smooth(method = "loess", span = 0.15) +
theme_minimal() +
facet_grid(y~ measure, scales = "free")
## `geom_smooth()` using formula 'y ~ x'
This final multivariate graph is not separated by diagnosis. It is a side-by-side displays the relationships between measurements (area, smoothness and texture) and their effect on concavity points.
ggplot(brcaLong, aes(x=value, y=x.concave_pts_mean)) +
ylab("Concavity Points") +
ggtitle("Effect of Mass Measurements on Concavity Points") +
geom_point(alpha=1, size=1, aes(color = y)) +
geom_smooth(method = "loess", span = 0.15) +
theme(legend.title = element_blank()) +
theme_bw() +
scale_color_manual(name = "Diagnosis", values = c("Benign" = 'lightpink',
"Malignant" = 'violetred')) +
facet_grid(~ measure, scales = "free") +
theme(legend.position = "bottom")
## `geom_smooth()` using formula 'y ~ x'
The dataset I have used is the Breast Cancer Wisconsin Diagnostic Dataset from UCI Machine Learning Repository. The dataset is comprised of multiple measurements taken of digitized images of the cell nuclei present in a breast mass. The dataset lists the mean, standard error and worse of each measurement (radius, texture, smoothness, etc.). For this assignment, I only explored the mean of each measurement. In order to best understand the correlation between these measurements and their effect (or lack thereof) on benign and malignant diagnoses, I needed to make these diagnoses testable. I first created a numeric column, is_cancerous, where benign was set equal to 0 and malignant set equal to 1. After mutating this new column, I created a correlation heatmap, where I determined a high correlation between diagnosis and concavity points and a low correlation between diagnosis and fractal dimension. To further explore these finding, I created a histogram displaying the correlation between concavity points (highest) and fractal dimension (lowest). These graphs supported my findings. Thereafter, I decided to explore the relationship between a few key considerations when diagnosing breast masses; area, smoothness, and texture. I figured it would be best to set these measurements as the explanatory variables and concavity points as the response variable. This would give a good indication of which measurements effect concavity points and thus have a higher correlation with diagnoses. However, in order to display these measurements side-by-side (facet), I needed to adjust the format of the dataset. I created a new dataset, brcaLong, where the three explanatory variables would fall in the first new column, measure, and their assigned value under the second new column, value. From there, I created two sets of scatterplots to compare average area, smoothness, and texture. Both sets display the average area, smoothness, and texture against concavity points, along with a loess regression line. The first set is physically separated by diagnosis, whereas the second set displays benign and malignant tumors together. Overall, my aim was to get a better understanding of breast cancer and whether or not specific measurements of masses are good indicators of benign or malignant tumors.
Source information: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)