This is a dataset of 10,000 mammograms taken from 4 clinics. Each clinic uses AI tools to estimate the percentage of dense tissue present, which is a major risk factor for developing breast cancer. The radiologist decides which density grade to allocate for the woman, (a, b, c or d). These grades are used amongst other things to allocate further screening. The X-Ray dose is also given.
Load packages:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(rstatix)
##
## Attaching package: 'rstatix'
##
## The following object is masked from 'package:stats':
##
## filter
Load the data into a dataframe:
df <- read_csv("./data/test_dataset.csv")
## Rows: 10000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): patient_id, grade, clinic, radiologist, machine
## dbl (2): volumetric_breast_density %, dose_per_image
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- as.data.frame(df)
Check for missing data: there is no missing data in the dataset
any(is.na(df))
## [1] FALSE
Check if patient IDs are unique: the number of unique patient IDs is the same as the number of rows in the dataset, so all patient IDs are unique.
length(unique(df$patient_id))
## [1] 10000
Check summary statistics for numerical variables:
summary(df)
## patient_id volumetric_breast_density % grade
## Length:10000 Min. : 0.06061 Length:10000
## Class :character 1st Qu.: 3.44984 Class :character
## Mode :character Median : 5.88529 Mode :character
## Mean : 6.99473
## 3rd Qu.: 9.43167
## Max. :40.39965
## clinic radiologist machine dose_per_image
## Length:10000 Length:10000 Length:10000 Min. :1.154
## Class :character Class :character Class :character 1st Qu.:1.481
## Mode :character Mode :character Mode :character Median :1.579
## Mean :1.657
## 3rd Qu.:1.688
## Max. :4.192
Get an idea of how many unique radiologist and machines we are dealing with:
table(df$radiologist)
##
## City Specialists 0 City Specialists 1 City Specialists 2 City Specialists 3
## 1012 1020 988 1007
## FastScreen 0 FastScreen 1 FastScreen 2 FastScreen 3
## 722 723 768 801
## Hospital Imaging 0 Hospital Imaging 1 Hospital Imaging 2 Hospital Imaging 3
## 477 528 498 480
## Smalltown Imaging 0 Smalltown Imaging 1 Smalltown Imaging 2 Smalltown Imaging 3
## 229 244 244 259
table(df$machine)
##
## City Specialists 0 City Specialists 1 City Specialists 2 City Specialists 3
## 958 1043 1010 1016
## FastScreen 0 FastScreen 1 FastScreen 2 FastScreen 3
## 751 748 752 763
## Hospital Imaging 0 Hospital Imaging 1 Hospital Imaging 2 Hospital Imaging 3
## 518 483 509 473
## Smalltown Imaging 0 Smalltown Imaging 1 Smalltown Imaging 2 Smalltown Imaging 3
## 223 231 247 275
Plot the volumetric breast density values and other variables in the dataset using boxplots and histograms:
ggplot(df, aes(x = grade, y = `volumetric_breast_density %`, fill = grade)) +
geom_boxplot() +
facet_wrap(~ clinic) +
labs(
x = "Assigned Grade",
y = "Volumetric Breast Density %",
title = "Volumetric Breast Density % by Grade and Clinic"
)
The distribution of Volumetric Breast Density % values assigned grade d by City Specialists looks different to other clinics and overlaps values typically assigned lower grades (a-c). Closer inspection revealed Volumetric Breast Density % values between 1 and 2 assigned grade d
ggplot(df, aes(x=`volumetric_breast_density %`)) +
geom_histogram(binwidth=1, fill="lightgreen", color="#e9ecef", alpha=0.9) +
facet_wrap(~ clinic) +
labs(
x = "Volumetric Breast Density % Frequencies",
title = "Volumetric Breast Density % Frequencies by Clinic"
)
The frequencies of Volumetric Breast Density % per clinic do not follow a normal distribution and the distribution looks different for one of the clinics (Smalltown Imaging) indicating possible differences in equipment calibration or patient population.
A visual test for normality for volumetric_breast_density %:
qqnorm(df$`volumetric_breast_density %`)
qqline(df$`volumetric_breast_density %`, col = "red")
The data points deviate from the red line indicating non-normal distribution
Do the assigned grades look the same between the clinics, or different?
I’ve chosen the non-parametric Kruskal-Wallis test suitable for non-normally distributed data to check if there is a statistically significant difference in the assignment of grades between clinics
grade_a_data <- subset(df, grade == "a")
kruskal.test(`volumetric_breast_density %` ~ clinic, data = grade_a_data)
##
## Kruskal-Wallis rank sum test
##
## data: volumetric_breast_density % by clinic
## Kruskal-Wallis chi-squared = 278.42, df = 3, p-value < 2.2e-16
Higher chi-squared value indicates a greater divergence between
median volumetric_breast_density % assigned grade a by
different clinics P-value < 0.05 indicates that the differences are
significant
grade_b_data <- subset(df, grade == "b")
kruskal.test(`volumetric_breast_density %` ~ clinic, data = grade_b_data)
##
## Kruskal-Wallis rank sum test
##
## data: volumetric_breast_density % by clinic
## Kruskal-Wallis chi-squared = 291.11, df = 3, p-value < 2.2e-16
grade_c_data <- subset(df, grade == "c")
kruskal.test(`volumetric_breast_density %` ~ clinic, data = grade_c_data)
##
## Kruskal-Wallis rank sum test
##
## data: volumetric_breast_density % by clinic
## Kruskal-Wallis chi-squared = 38.096, df = 3, p-value = 2.697e-08
grade_d_data <- subset(df, grade == "d")
kruskal.test(`volumetric_breast_density %` ~ clinic, data = grade_d_data)
##
## Kruskal-Wallis rank sum test
##
## data: volumetric_breast_density % by clinic
## Kruskal-Wallis chi-squared = 176.75, df = 3, p-value < 2.2e-16
The Kruskal-Wallis test revealed statistically significant differences in the way clinics assign grades. The magnitude of these differences, from largest to smallest, was observed for grade b, followed by grades a, d, and c.
Based only on the data, do any of the machines use unusually high X-Ray doses?
Plot volumetric breast density values by grade and machine with the colour scale indicating the x-ray dose:
ggplot(df, aes(x = grade, y = `volumetric_breast_density %`, color = dose_per_image)) +
geom_jitter(width = 0.3, alpha = 0.4) +
facet_wrap(~ machine)
Points in the plot for the “Hospital Imaging 0” machine correspond to higher x-ray doses and this may indicate a problem with the machine
Check if dose_per_image is normally distributed to determine appropriate statistical test:
qqnorm(df$dose_per_image)
qqline(df$dose_per_image, col = "red")
The data points deviate from the red line indicating non-normal distribution
I’ve chosen a non-parametric test test to check if there are significant differences in the medians of dose_per_image values between different machines:
kruskal.test(dose_per_image ~ machine, data = df)
##
## Kruskal-Wallis rank sum test
##
## data: dose_per_image by machine
## Kruskal-Wallis chi-squared = 1487.9, df = 15, p-value < 2.2e-16
The results indicate that significant differences exist in median dose_per_image values between different machines.
The following steps are done to flag machines with outlier median dose_per_image values.
Calculate the median dose_per_image for each machine
Q4_summary <- df %>%
group_by(machine) %>%
summarize(n = length(dose_per_image),
DPI_median = round(median(dose_per_image),2))
median_dose_per_machine <- as.data.frame(Q4_summary)
Use Median Absolute Deviation to work out the lower and upper threshold for median dose_per_image beyond which the median will be considered an outlier
# median of dose_per_image medians per machine
med <- median(median_dose_per_machine$DPI_median)
# the constant value can be adjuested based on the data
mad_value <- mad(median_dose_per_machine$DPI_median, constant = 5)
#the Median Absolute Deviation is multiplied by 3 to work out the lower and upper thresholds
lower <- med - 3 * mad_value
upper <- med + 3 * mad_value
outliers <- median_dose_per_machine[median_dose_per_machine$DPI_median < lower | median_dose_per_machine$DPI_median > upper, ]
print(outliers)
## machine n DPI_median
## 9 Hospital Imaging 0 518 3.15
Machines listed in the outliers dataframe may need to be checked. In this example it’s Hospital Imaging 0.
The “FastScreen” clinic wants to know how their radiologists are performing compared to each other. What can you say to them given the data, and what can’t you say?
Plot the distribution of grading by clinic and radiologist:
ggplot(df, aes(x = grade, y = `volumetric_breast_density %`, color = radiologist)) +
geom_boxplot() +
facet_wrap(~ clinic)
The box plots of grades assigned by radiologist “FastScreen2” look different to the grading by other radiologists at that clinic.
Check if the difference is significant for individual grades:
FastScreen_data <- subset(df, clinic == "FastScreen")
FastScreen_grade_a_data <- subset(FastScreen_data, grade == "a")
kruskal.test(`volumetric_breast_density %` ~ radiologist, data = FastScreen_grade_a_data)
##
## Kruskal-Wallis rank sum test
##
## data: volumetric_breast_density % by radiologist
## Kruskal-Wallis chi-squared = 65.654, df = 3, p-value = 3.634e-14
The results indicate a significant difference between radiologists for grade a.
Use Dunn’s Test for post hoc analysis to check which radiologist’s data differ from the other radiologists:
dunn_test(`volumetric_breast_density %` ~ radiologist, data = FastScreen_grade_a_data, p.adjust.method = "bonferroni")
## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 volumetric… FastS… FastS… 159 188 0.401 6.88e- 1 1 e+ 0 ns
## 2 volumetric… FastS… FastS… 159 149 6.55 5.59e-11 3.35e-10 ****
## 3 volumetric… FastS… FastS… 159 220 -0.251 8.02e- 1 1 e+ 0 ns
## 4 volumetric… FastS… FastS… 188 149 6.42 1.37e-10 8.20e-10 ****
## 5 volumetric… FastS… FastS… 188 220 -0.698 4.85e- 1 1 e+ 0 ns
## 6 volumetric… FastS… FastS… 149 220 -7.29 3.10e-13 1.86e-12 ****
There is a significant difference in how radiologist FastScreen 2 assigns grade a compared to the other radiologists.
Grades b, c and d can be tested using the same method
I can’t comment on the quality of the grading or reasons for the differences.