Dataset description

This is a dataset of 10,000 mammograms taken from 4 clinics. Each clinic uses AI tools to estimate the percentage of dense tissue present, which is a major risk factor for developing breast cancer. The radiologist decides which density grade to allocate for the woman, (a, b, c or d). These grades are used amongst other things to allocate further screening. The X-Ray dose is also given.

Getting familiar with the dataset

Load packages:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(rstatix)

## 
## Attaching package: 'rstatix'
## 
## The following object is masked from 'package:stats':
## 
##     filter

Load the data into a dataframe:

df <- read_csv("./data/test_dataset.csv")

## Rows: 10000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): patient_id, grade, clinic, radiologist, machine
## dbl (2): volumetric_breast_density %, dose_per_image
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df <- as.data.frame(df)

Check for missing data: there is no missing data in the dataset

any(is.na(df))

## [1] FALSE

Check if patient IDs are unique: the number of unique patient IDs is the same as the number of rows in the dataset, so all patient IDs are unique.

length(unique(df$patient_id))

## [1] 10000

Check summary statistics for numerical variables:

summary(df)

##   patient_id        volumetric_breast_density %    grade          
##  Length:10000       Min.   : 0.06061            Length:10000      
##  Class :character   1st Qu.: 3.44984            Class :character  
##  Mode  :character   Median : 5.88529            Mode  :character  
##                     Mean   : 6.99473                              
##                     3rd Qu.: 9.43167                              
##                     Max.   :40.39965                              
##     clinic          radiologist          machine          dose_per_image 
##  Length:10000       Length:10000       Length:10000       Min.   :1.154  
##  Class :character   Class :character   Class :character   1st Qu.:1.481  
##  Mode  :character   Mode  :character   Mode  :character   Median :1.579  
##                                                           Mean   :1.657  
##                                                           3rd Qu.:1.688  
##                                                           Max.   :4.192

Get an idea of how many unique radiologist and machines we are dealing with:

table(df$radiologist)

## 
##  City Specialists 0  City Specialists 1  City Specialists 2  City Specialists 3 
##                1012                1020                 988                1007 
##        FastScreen 0        FastScreen 1        FastScreen 2        FastScreen 3 
##                 722                 723                 768                 801 
##  Hospital Imaging 0  Hospital Imaging 1  Hospital Imaging 2  Hospital Imaging 3 
##                 477                 528                 498                 480 
## Smalltown Imaging 0 Smalltown Imaging 1 Smalltown Imaging 2 Smalltown Imaging 3 
##                 229                 244                 244                 259

table(df$machine)

## 
##  City Specialists 0  City Specialists 1  City Specialists 2  City Specialists 3 
##                 958                1043                1010                1016 
##        FastScreen 0        FastScreen 1        FastScreen 2        FastScreen 3 
##                 751                 748                 752                 763 
##  Hospital Imaging 0  Hospital Imaging 1  Hospital Imaging 2  Hospital Imaging 3 
##                 518                 483                 509                 473 
## Smalltown Imaging 0 Smalltown Imaging 1 Smalltown Imaging 2 Smalltown Imaging 3 
##                 223                 231                 247                 275

Plot the volumetric breast density values and other variables in the dataset using boxplots and histograms:

ggplot(df, aes(x = grade, y = `volumetric_breast_density %`, fill = grade)) +
  geom_boxplot() +
  facet_wrap(~ clinic) +
  labs(
    x = "Assigned Grade",
    y = "Volumetric Breast Density %",
    title = "Volumetric Breast Density % by Grade and Clinic"
  )

The distribution of Volumetric Breast Density % values assigned grade d by City Specialists looks different to other clinics and overlaps values typically assigned lower grades (a-c). Closer inspection revealed Volumetric Breast Density % values between 1 and 2 assigned grade d

ggplot(df, aes(x=`volumetric_breast_density %`)) + 
  geom_histogram(binwidth=1, fill="lightgreen", color="#e9ecef", alpha=0.9) +
  facet_wrap(~ clinic) +
  labs(
    x = "Volumetric Breast Density % Frequencies",
    title = "Volumetric Breast Density % Frequencies by Clinic"
  )

The frequencies of Volumetric Breast Density % per clinic do not follow a normal distribution and the distribution looks different for one of the clinics (Smalltown Imaging) indicating possible differences in equipment calibration or patient population.

A visual test for normality for volumetric_breast_density %:

qqnorm(df$`volumetric_breast_density %`)
qqline(df$`volumetric_breast_density %`, col = "red")

The data points deviate from the red line indicating non-normal distribution

Assignment of grades between different clinics

Do the assigned grades look the same between the clinics, or different?

I’ve chosen the non-parametric Kruskal-Wallis test suitable for non-normally distributed data to check if there is a statistically significant difference in the assignment of grades between clinics

grade_a_data <- subset(df, grade == "a")
kruskal.test(`volumetric_breast_density %` ~ clinic, data = grade_a_data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  volumetric_breast_density % by clinic
## Kruskal-Wallis chi-squared = 278.42, df = 3, p-value < 2.2e-16

Higher chi-squared value indicates a greater divergence between median volumetric_breast_density % assigned grade a by different clinics P-value < 0.05 indicates that the differences are significant

grade_b_data <- subset(df, grade == "b")
kruskal.test(`volumetric_breast_density %` ~ clinic, data = grade_b_data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  volumetric_breast_density % by clinic
## Kruskal-Wallis chi-squared = 291.11, df = 3, p-value < 2.2e-16

grade_c_data <- subset(df, grade == "c")
kruskal.test(`volumetric_breast_density %` ~ clinic, data = grade_c_data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  volumetric_breast_density % by clinic
## Kruskal-Wallis chi-squared = 38.096, df = 3, p-value = 2.697e-08

grade_d_data <- subset(df, grade == "d")
kruskal.test(`volumetric_breast_density %` ~ clinic, data = grade_d_data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  volumetric_breast_density % by clinic
## Kruskal-Wallis chi-squared = 176.75, df = 3, p-value < 2.2e-16

The Kruskal-Wallis test revealed statistically significant differences in the way clinics assign grades. The magnitude of these differences, from largest to smallest, was observed for grade b, followed by grades a, d, and c.

X-ray doses

Based only on the data, do any of the machines use unusually high X-Ray doses?

Plot volumetric breast density values by grade and machine with the colour scale indicating the x-ray dose:

ggplot(df, aes(x = grade, y = `volumetric_breast_density %`, color = dose_per_image)) +
  geom_jitter(width = 0.3, alpha = 0.4) +
  facet_wrap(~ machine)

Points in the plot for the “Hospital Imaging 0” machine correspond to higher x-ray doses and this may indicate a problem with the machine

Flag up machines wich use high X-ray doses

Check if dose_per_image is normally distributed to determine appropriate statistical test:

qqnorm(df$dose_per_image)
qqline(df$dose_per_image, col = "red")

The data points deviate from the red line indicating non-normal distribution

I’ve chosen a non-parametric test test to check if there are significant differences in the medians of dose_per_image values between different machines:

kruskal.test(dose_per_image ~ machine, data = df)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  dose_per_image by machine
## Kruskal-Wallis chi-squared = 1487.9, df = 15, p-value < 2.2e-16

The results indicate that significant differences exist in median dose_per_image values between different machines.

The following steps are done to flag machines with outlier median dose_per_image values.

Calculate the median dose_per_image for each machine

Q4_summary <- df %>%
  group_by(machine) %>%
  summarize(n = length(dose_per_image),
            DPI_median = round(median(dose_per_image),2))
median_dose_per_machine <- as.data.frame(Q4_summary)

Use Median Absolute Deviation to work out the lower and upper threshold for median dose_per_image beyond which the median will be considered an outlier

# median of dose_per_image medians per machine
med <- median(median_dose_per_machine$DPI_median)
# the constant value can be adjuested based on the data 
mad_value <- mad(median_dose_per_machine$DPI_median, constant = 5)

#the Median Absolute Deviation is multiplied by 3 to work out the lower and upper thresholds
lower <- med - 3 * mad_value
upper <- med + 3 * mad_value
outliers <- median_dose_per_machine[median_dose_per_machine$DPI_median < lower | median_dose_per_machine$DPI_median > upper, ]
print(outliers)

##              machine   n DPI_median
## 9 Hospital Imaging 0 518       3.15

Machines listed in the outliers dataframe may need to be checked. In this example it’s Hospital Imaging 0.

Grade assignment by individual radiologists

The “FastScreen” clinic wants to know how their radiologists are performing compared to each other. What can you say to them given the data, and what can’t you say?

Plot the distribution of grading by clinic and radiologist:

ggplot(df, aes(x = grade, y = `volumetric_breast_density %`, color = radiologist)) +
  geom_boxplot() +
  facet_wrap(~ clinic)

The box plots of grades assigned by radiologist “FastScreen2” look different to the grading by other radiologists at that clinic.

Check if the difference is significant for individual grades:

FastScreen_data <- subset(df, clinic == "FastScreen")
FastScreen_grade_a_data <- subset(FastScreen_data, grade == "a")
kruskal.test(`volumetric_breast_density %` ~ radiologist, data = FastScreen_grade_a_data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  volumetric_breast_density % by radiologist
## Kruskal-Wallis chi-squared = 65.654, df = 3, p-value = 3.634e-14

The results indicate a significant difference between radiologists for grade a.

Use Dunn’s Test for post hoc analysis to check which radiologist’s data differ from the other radiologists:

dunn_test(`volumetric_breast_density %` ~ radiologist, data = FastScreen_grade_a_data, p.adjust.method = "bonferroni")

## # A tibble: 6 × 9
##   .y.         group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>       <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 volumetric… FastS… FastS…   159   188     0.401 6.88e- 1 1   e+ 0 ns          
## 2 volumetric… FastS… FastS…   159   149     6.55  5.59e-11 3.35e-10 ****        
## 3 volumetric… FastS… FastS…   159   220    -0.251 8.02e- 1 1   e+ 0 ns          
## 4 volumetric… FastS… FastS…   188   149     6.42  1.37e-10 8.20e-10 ****        
## 5 volumetric… FastS… FastS…   188   220    -0.698 4.85e- 1 1   e+ 0 ns          
## 6 volumetric… FastS… FastS…   149   220    -7.29  3.10e-13 1.86e-12 ****

There is a significant difference in how radiologist FastScreen 2 assigns grade a compared to the other radiologists.

Grades b, c and d can be tested using the same method

I can’t comment on the quality of the grading or reasons for the differences.

Sample statistical data analysis