03_EDA_datasets_itu

Author

Sergio Uribe

Modified

June 14, 2024

Packages

Datasets

Data cleaning

How many datasets?

[1] 16 62

From which databases?

Database	n	percent
Kaggle	3	18.8%
Github	2	12.5%
Google Datasets	2	12.5%
Mendeley	2	12.5%
PubMed	2	12.5%
Zenodo	2	12.5%
Grand-challenge	1	6.2%
OSF	1	6.2%
arXiv	1	6.2%

Year of dataset publication

Year.of.dataset.publication	n	percent
2020	1	6.2%
2021	2	12.5%
2022	6	37.5%
2023	7	43.8%

Associated with publication?

Paper.associated	n	percent
No	5	31.2%
Yes	11	68.8%

Areas of research

Value	n	percent
Teeth segmentation	10	0.3448276
Teeth labeling	9	0.3103448
Caries	3	0.1034483
Oral Pathology	3	0.1034483
Endodontics	1	0.0344828
Cehalometric	1	0.0344828
Endodontics	1	0.0344828
Oral Surgery	1	0.0344828

Imaging modality

Value	n	percent
Panoramic radiographs	10	58.8%
Cone Beam Computed Tomography (CBCT)	2	11.8%
Intraoral photograph	2	11.8%
Cephalometric radiographs	1	5.9%
Intra-oral 3D scans	1	5.9%
Intraoral radiographs	1	5.9%

Images amount analysis

name	sum	median	average	sd
Images…CBCT	557	278.5	278.5	156.3
Images…Intraoral.radiographs	757	188.0	252.3	241.0
Images…Other	3781	925.0	945.2	731.6
Images…Panoramic	5355	180.0	595.0	790.1

Patients per dataset and country

Imaging.modality..multiple.choices.	n	mean_patients	sd_patients
Cephalometric radiographs	1	NA	NA
Cone Beam Computed Tomography (CBCT)	2	NA	259.5082
Intra-oral 3D scans	1	NA	NA
Intraoral photograph	2	NA	NA
Panoramic radiographs	6	NA	444.1859

MAP

Distribution by country

How many countries?

value	n	percent
CHN	4	22.2%
IRN	2	11.1%
USA	2	11.1%
BEL	1	5.6%
CHE	1	5.6%
ESP	1	5.6%
FRA	1	5.6%
IND	1	5.6%
KOR	1	5.6%
PRY	1	5.6%
SAU	1	5.6%
TUN	1	5.6%
TWN	1	5.6%

Country	n	percent
China	4	22.2%
Iran	2	11.1%
United States	2	11.1%
Belgium	1	5.6%
Switzerland	1	5.6%
Spain	1	5.6%
France	1	5.6%
India	1	5.6%
South Korea	1	5.6%
Paraguay	1	5.6%
Saudi Arabia	1	5.6%
Tunisia	1	5.6%
Taiwan	1	5.6%

By numbers of images

country	n
China	2413
Switzerland	2332
Belgium	1800
France	1800
Iran	1504
United States	1117
Taiwan	600
Tunisia	180
Paraguay	135
India	131
Saudi Arabia	50
South Korea	50
Spain	50

Number of images per repository source

Database	n	images	sd_images	img_per_dataset
Kaggle	3	2653	652.5	884.3
Github	2	263	79.9	131.5
Google Datasets	2	1037	567.8	518.5
Mendeley	2	412	36.8	206.0
PubMed	2	1050	671.8	525.0
Zenodo	2	2467	1553.5	1233.5
Grand-challenge	1	600	NA	600.0
OSF	1	1800	NA	1800.0
arXiv	1	168	NA	168.0

Table of images, datasets per imaging modality

name	n	sum_images	sd_images
Images…CBCT	2	557	156.3
Images…Intraoral.radiographs	3	757	241.0
Images…Other	4	3781	731.6
Images…Panoramic	9	5355	790.1

Metadata analysis

name	Percentage Yes
Contain annotations	75.0
Associated to paper	68.8
Type of image processing	68.8
Annotation tool reported	66.7
Anatomic segmentation	62.5
Patients number	62.5
Ground truth method explanation	56.2
Annotators experience reported	53.8
Ground truth definition	50.0
Image processing	50.0
Lesion segmentation	50.0
Anonymization strategy	43.8
Equipment used	43.8
Ethical approval	31.2
Patient inclusion/exclusion criteria	31.2
Annotators calibration	18.8
Patient sex distribution	18.8
Annotator dispute handling	16.7
Annotators training	15.4
Annotator age reporting	7.7
Patient consent	6.2
Calibration metric reported	0.0
Patient ethnicity	0.0

Colors for yes/no

FAIR ANALYSIS

name	n	mean	sd
Findable	16	56.2	27.6
Accesible	16	60.4	37.0
Interoperable	16	50.0	37.6
Reusable	16	41.9	19.7

Final Table

Imaging	n	images
Cephalometric radiographs	1	600
Cone Beam Computed Tomography (CBCT)	2	1088
Intraoral 3D Scans or images	4	4981
Intraoral radiographs	3	150
Panoramic radiographs	9	6263

Imaging	n
Cephalometric radiographs	1
Cone Beam Computed Tomography (CBCT)	2
Intraoral 3D Scans or images	4
Intraoral radiographs	3
Panoramic radiographs	9

name	n	sum	mean	sd
Images…CBCT	2	557	278.5	156.3
Images…Intraoral.radiographs	3	757	252.3	241.0
Images…Panoramic	9	5355	595.0	790.1

Imaging	BEL	CHE	CHN	ESP	FRA	IND	IRN	KOR	PRY	SAU	TUN	TWN	USA
Cephalometric radiographs	0	0	0	0	0	0	0	0	0	0	0	1	0
Cone Beam Computed Tomography (CBCT)	0	0	2	0	0	0	0	0	0	0	0	0	0
Intraoral 3D Scans or images	1	0	1	0	1	1	0	0	0	0	0	0	0
Intraoral radiographs	0	0	0	1	0	0	0	1	0	1	0	0	0
Panoramic radiographs	0	1	2	0	0	0	2	0	1	0	1	0	2

Imaging	Accessible.max.3	FAIRness	Findable.max.7	Interoperable.max.4	Reusable.max.10
Other	1.8	45.8	3.5	1.6	4.2
Cone Beam Computed Tomography (CBCT)	2.5	65.0	4.8	3.5	5.0
Intraoral radiographs	2.0	75.0	7.0	3.0	6.0
Panoramic radiographs	1.7	48.4	4.1	2.1	3.9

--- title: "03_EDA_datasets_itu" author: "Sergio Uribe" date-modified: last-modified format: html: toc: true toc-expand: 3 code-fold: true code-tools: true editor: visual execute: echo: false cache: false warning: false message: false --- # Packages ```{r} # Load required libraries with pacman; installs them if not already installed pacman::p_load(tidyverse, # tools for data science visdat, #NAs janitor, # for data cleaning and tables here, # for reproducible research gtsummary, # for tables maps, patchwork, viridis, scales, countrycode # to normalize country data ) ``` ```{r} theme_set(theme_minimal()) ``` # Datasets ```{r} df <- read.csv(here("data", "df.csv")) ``` ```{r} df_long <- read.csv(here("data", "df_long.csv")) ``` # Data cleaning ```{r} # Only dataset with image number # df |> # filter(Is.the.Number.of.images.in.the.dataset.reported. == "Yes") ``` ## How many datasets? ```{r} dim(df) ``` ## From which databases? ```{r} df |> tabyl(Database) |> adorn_pct_formatting() |> arrange(desc(n)) |> knitr::kable() ``` ```{r} df |> tabyl(Database) |> ggplot(aes(x = fct_reorder(Database, n ) , y = n)) + geom_col() + coord_flip() + labs(title = "Datasets by Database", x = "") ``` ## Year of dataset publication ```{r} df |> tabyl(Year.of.dataset.publication) |> adorn_pct_formatting() |> knitr::kable() ``` ```{r} df |> ggplot(aes(x = Year.of.dataset.publication)) + geom_bar() + labs(title = "Year of dataset publication", x = "Year", y = "n") ``` ## Associated with publication? ```{r} df |> tabyl(Paper.associated) |> adorn_pct_formatting() |> knitr::kable() ``` ## Areas of research ```{r} df |> pivot_longer(cols = starts_with("Areas"), names_to = "Area", values_to = "Value") |> separate_rows(Value, sep = ",") |> filter(!is.na(Value)) |> select(Value) |> tabyl(Value) |> arrange(desc(n)) |> knitr::kable() ``` ```{r} df |> pivot_longer(cols = starts_with("Areas"), names_to = "Area", values_to = "Value") |> separate_rows(Value, sep = ",") |> filter(!is.na(Value)) |> select(Value) |> tabyl(Value) |> ggplot(aes(x = fct_reorder(Value, n), y = n)) + geom_col() + theme_minimal() + labs(title = "Distribution of Research Areas", x = "Research Area", y = "Count", caption = "Each dataset can be in more than one area") + coord_flip() + scale_y_continuous(breaks= pretty_breaks()) ``` ## Imaging modality ```{r} df |> pivot_longer(cols = starts_with("Imaging"), names_to = "Imaging", values_to = "Value") |> separate_rows(Value, sep = ",") |> filter(!is.na(Value)) |> select(Value) |> mutate(Value = str_trim(Value, side = c( "both")) ) |> tabyl(Value) |> adorn_pct_formatting() |> arrange(desc(n)) |> knitr::kable() ``` ```{r} df |> pivot_longer(cols = starts_with("Imaging"), names_to = "Imaging", values_to = "Value") |> separate_rows(Value, sep = ",") |> filter(!is.na(Value)) |> select(Value) |> mutate(Value = str_trim(Value, side = c( "both")) ) |> tabyl(Value) |> ggplot(aes(x = fct_reorder(Value, n), y = n)) + geom_col() + theme_minimal() + labs(title = "Distribution of Dataset by Imaging Modalities", x = "Imaging Modality", y = "Count", caption = "Each dataset can be in more than one area") + coord_flip() + scale_y_continuous(breaks= pretty_breaks()) ``` ## Images amount analysis ```{r} df |> select(Response.ID, Images...Panoramic:Images...Other) |> pivot_longer(-Response.ID) |> group_by(name) |> filter(!is.na(value)) |> summarise(sum = sum(value), median = median(value), average = mean(value), sd = sd(value)) |> mutate(across(where(is.numeric), round, 1)) |> knitr::kable() ``` ```{r} df |> select(Response.ID, Images...Panoramic:Images...Other) |> pivot_longer(-Response.ID) |> mutate(name = str_replace_all(name, "Images...", "")) |> ggplot(aes(x = name, y = value)) + geom_col() ``` ## Patients per dataset and country ```{r} df |> separate_rows(Imaging.modality..multiple.choices., sep = ", ") |> group_by(Imaging.modality..multiple.choices.) |> filter(!is.na(Number.of.patients.in.the.dataset)) |> summarise(n = n(), mean_patients = mean(Number.of.patients.in.the.dataset, na.rm = T), sd_patients = sd(Number.of.patients.in.the.dataset, na.rm = T))|> knitr::kable() ``` ## MAP ### Distribution by country How many countries? ```{r} df |> select(Response.ID, Number.of.images.in.the.dataset, country_1, country_2, country_3) |> pivot_longer(-c(Response.ID, Number.of.images.in.the.dataset)) |> filter(!is.na(value)) |> tabyl(value) |> arrange(desc(n)) |> adorn_pct_formatting()|> knitr::kable() ``` ```{r} df |> pivot_longer(cols = starts_with("country")) |> filter(!is.na(value)) |> tabyl(value) |> adorn_pct_formatting() |> arrange(desc(n)) |> rename(Country = value) |> mutate(Country = countrycode(Country, "iso3c", "country.name")) |> knitr::kable() ``` ```{r} countries <- df |> pivot_longer(cols = starts_with("country")) |> filter(!is.na(value)) |> mutate(value = countrycode(value, "iso3c", "country.name")) |> tabyl(value) |> adorn_pct_formatting() |> arrange(desc(n)) |> rename(country = value) ``` ```{r} # Load world map data world_map <- map_data("world") world_map <- world_map |> mutate(region = countrycode(region, "country.name", "country.name")) ``` ```{r} merged_data <- world_map |> left_join(countries, by = c("region" = "country")) ``` ```{r} merged_data |> ggplot() + geom_polygon(aes( x = long, y = lat, group = group, fill = n ), color = "Grey 80") + scale_fill_viridis_c(option = "plasma", na.value = "Grey 97", direction = -1) + # coord_sf(crs= "+proj=cea +lon_0=0 +x_0=0 +y_0=0 +lat_ts=45 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") + theme_minimal() + labs(fill = "N", title = "A. Dataset Distribution by Country", caption = "Some datasets are associated with multiple countries") ``` ```{r} map_by_dataset <- merged_data |> ggplot() + geom_polygon(aes( x = long, y = lat, group = group, fill = n ), color = "Grey 80") + scale_fill_viridis_c(option = "plasma", na.value = "Grey 97", direction = -1) + theme_minimal() + labs(fill = "N", title = "A. Dataset Distribution by Country", caption = "Some datasets are associated with multiple countries") ``` ### By numbers of images ```{r} df |> select(starts_with("country"), Number.of.images.in.the.dataset) |> pivot_longer(-Number.of.images.in.the.dataset) |> filter(!is.na(value)) |> rename(n = Number.of.images.in.the.dataset) |> select(-name) |> mutate(country = countrycode(value, "iso3c", "country.name")) |> select(-value) |> group_by(country) |> summarise(n = sum(n)) |> arrange(desc(n))|> knitr::kable() ``` ```{r} countries <- df |> select(starts_with("country"), Number.of.images.in.the.dataset) |> pivot_longer(-Number.of.images.in.the.dataset) |> filter(!is.na(value)) |> rename(n = Number.of.images.in.the.dataset) |> select(-name) |> mutate(country = countrycode(value, "iso3c", "country.name")) |> select(-value) |> group_by(country) |> summarise(n = sum(n)) |> arrange(desc(n)) ``` ```{r} merged_data <- world_map |> left_join(countries, by = c("region" = "country")) ``` ```{r} merged_data |> ggplot() + geom_polygon(aes( x = long, y = lat, group = group, fill = n ), color = "Grey 80") + scale_fill_viridis_c(option = "plasma", na.value = "Grey 97", direction = -1) + theme_minimal() + labs(fill = "N", title = "B. Image Count per Country", caption = "Some datasets are associated with multiple countries") ``` ```{r} map_by_images <- merged_data |> ggplot() + geom_polygon(aes( x = long, y = lat, group = group, fill = n ), color = "Grey 80") + scale_fill_viridis_c(option = "plasma", na.value = "Grey 97", direction = -1) + theme_minimal() + labs(fill = "N", title = "B. Image Count per Country", caption = "Some datasets are associated with multiple countries") ``` ```{r} map_by_dataset / map_by_images ``` ```{r} ggsave(here("figures", "Fig2_map.pdf"), dpi = 300, height = 30, width = 25, units = c("cm")) ``` ```{r} rm(merged_data, world_map, countries, map_by_images, map_by_dataset) ``` ## Number of images per repository source ```{r} df |> group_by(Database) |> summarise(n = n(), images = sum(Number.of.images.in.the.dataset), sd_images = sd(Number.of.images.in.the.dataset)) |> arrange(desc(n)) |> mutate("img_per_dataset" = images / n) |> mutate(across(where(is.numeric), round, 1)) |> knitr::kable() ``` ## Table of images, datasets per imaging modality ```{r} df |> select(Response.ID, contains("Images...")) |> pivot_longer(-Response.ID) |> filter(!is.na(value)) |> group_by(name) |> summarise(n = n(), sum_images = sum(value), sd_images = sd(value)) |> mutate(across(where(is.numeric), round, 1)) |> knitr::kable() ``` ## Metadata analysis ```{r} df |> # select relevant yes no columns select(Response.ID, Paper.associated, reporting...Ethical.approval.for.dataset.publication.:reporting...Image.acquisition.device..e.g..Sirona..Germany.., reporting...Image.processing., reporting...Gender.ratio..males.females.., reporting...Ethnicity. , Does.the.dataset.include.annotations., Is.the.calibration.of.training.of.the.annotators.described. : Is.the.Number.of.patients.in.the.dataset.reported.) |> # remove unwanted columns select(-Annotation.Software, How.was.the.ground.truth...gold.standard.established.in.the.study.) |> # relevel if the ground truth was annotated mutate(How.was.the.ground.truth...gold.standard.established.in.the.study. = if_else( How.was.the.ground.truth...gold.standard.established.in.the.study. == "Not described", "No", "Yes")) |> pivot_longer(-Response.ID) |> filter(!is.na(value)) |> mutate(value = fct_collapse(value, "No" = c("Not specified", "Not sure"))) |> mutate(name = recode(name, "reporting...Ground.truth.or.gold.standard.method.described." = "Ground truth method explanation", "Does.the.dataset.include.annotations." = "Contain annotations", "How.was.the.ground.truth...gold.standard.established.in.the.study." = "Ground truth definition", "Is.the.Number.of.patients.in.the.dataset.reported." = "Patients number", "Is.the.calibration.of.training.of.the.annotators.described." = "Annotators calibration", "Paper.associated" = "Associated to paper", "annotators..Is.any.metric.related.to.the.calibration.of.annotators.reported..kappa..ICC..etc..." = "Calibration metric reported", "annotators..Is.described.the.calibration.or.training.of.the.annotators.." = "Annotators training", "annotators..Is.the.age.of.annotators.reported.." = "Annotator age reporting", "annotators..Is.the.experience.or.qualifications.of.the.annotators.described.." = "Annotators experience reported", "annotators..Is.the.reporting.of.mechanisms.strategies.to.deal.with.disagreements.included.in.the.study.." = "Annotator dispute handling", "annotators..Is.the.software.used.for.annotations.described.in.the.study.." = "Annotation tool reported", "reporting...Anonymisation.strategy." = "Anonymization strategy", "reporting...Ethical.approval.for.dataset.publication." = "Ethical approval", "reporting...Ethnicity." = "Patient ethnicity", "reporting...Gender.ratio..males.females.." = "Patient sex distribution", "reporting...Image.acquisition.device..e.g..Sirona..Germany.." = "Equipment used", "reporting...Image.processing." = "Image processing", "reporting...Image.processing.or.adjustment." = "Type of image processing", "reporting...Inclusion.or.exclusion.criteria.stated." = "Patient inclusion/exclusion criteria", "reporting...Lesion.feature.or.image.size.annotations." = "Lesion segmentation", "reporting...Participant.consent." = "Patient consent", "reporting...Segmentations." = "Anatomic segmentation" )) |> # calculate the % yes group_by(name) |> summarize( `Percentage Yes` = mean(value == "Yes") * 100, .groups = 'drop' ) |> mutate(across(where(is.numeric), round, 1)) |> arrange(desc(`Percentage Yes`)) |> knitr::kable() ``` Colors for yes/no ```{r} # Define colors with distinct luminance values # colors <- c("No" = "#F8766D", "Yes" = "#00BFC4") # Base colors <- c("No" = "#F8766D", "Yes" = "#009498") ``` ```{r} df |> # select relevant yes no columns select(Response.ID, Paper.associated, reporting...Ethical.approval.for.dataset.publication.:reporting...Image.acquisition.device..e.g..Sirona..Germany.., reporting...Image.processing., reporting...Gender.ratio..males.females.., reporting...Ethnicity. , Does.the.dataset.include.annotations., Is.the.calibration.of.training.of.the.annotators.described. : Is.the.Number.of.patients.in.the.dataset.reported.) |> # remove unwanted columns select(-Annotation.Software, How.was.the.ground.truth...gold.standard.established.in.the.study.) |> # relevel if the ground truth was annotated mutate(How.was.the.ground.truth...gold.standard.established.in.the.study. = if_else( How.was.the.ground.truth...gold.standard.established.in.the.study. == "Not described", "No", "Yes")) |> pivot_longer(-Response.ID) |> filter(!is.na(value)) |> mutate(value = fct_collapse(value, "No" = c("Not specified", "Not sure"))) |> mutate(name = recode(name, "reporting...Ground.truth.or.gold.standard.method.described." = "Ground truth method explanation", "Does.the.dataset.include.annotations." = "Contain annotations", "How.was.the.ground.truth...gold.standard.established.in.the.study." = "Ground truth definition", "Is.the.Number.of.patients.in.the.dataset.reported." = "Patients number", "Is.the.calibration.of.training.of.the.annotators.described." = "Annotators calibration", "Paper.associated" = "Associated to paper", "annotators..Is.any.metric.related.to.the.calibration.of.annotators.reported..kappa..ICC..etc..." = "Calibration metric reported", "annotators..Is.described.the.calibration.or.training.of.the.annotators.." = "Annotators training", "annotators..Is.the.age.of.annotators.reported.." = "Annotator age reporting", "annotators..Is.the.experience.or.qualifications.of.the.annotators.described.." = "Annotators experience reported", "annotators..Is.the.reporting.of.mechanisms.strategies.to.deal.with.disagreements.included.in.the.study.." = "Annotator dispute handling", "annotators..Is.the.software.used.for.annotations.described.in.the.study.." = "Annotation tool reported", "reporting...Anonymisation.strategy." = "Anonymization strategy", "reporting...Ethical.approval.for.dataset.publication." = "Ethical approval", "reporting...Ethnicity." = "Patient ethnicity", "reporting...Gender.ratio..males.females.." = "Patient sex distribution", "reporting...Image.acquisition.device..e.g..Sirona..Germany.." = "Equipment used", "reporting...Image.processing." = "Image processing", "reporting...Image.processing.or.adjustment." = "Type of image processing", "reporting...Inclusion.or.exclusion.criteria.stated." = "Patient inclusion/exclusion criteria", "reporting...Lesion.feature.or.image.size.annotations." = "Lesion segmentation", "reporting...Participant.consent." = "Patient consent", "reporting...Segmentations." = "Anatomic segmentation" )) |> ggplot(aes( x = fct_reorder(name, value, .fun = function(x) mean(x == "Yes") ), fill = value )) + geom_bar(position = "fill", aes(y = ..prop.., group = value)) + scale_y_continuous(labels = scales::percent_format()) + labs(x = "Question", y = "Percentage", fill = "Answer", title = "Metadata Completeness of Dental Imaging Datasets for AI") + theme_minimal() + # theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_hline(yintercept = 0.25, linetype = "dashed", color = "lightgrey") + geom_hline(yintercept = 0.5, linetype = "dashed", color = "lightgrey") + geom_hline(yintercept = 0.75, linetype = "dashed", color = "lightgrey") + coord_flip() + scale_fill_manual(values = colors) # scale_fill_viridis_d(option = "viridis", direction = -1) # scale_fill_grey(start = 0.8, end = 0.2) ``` ```{r} ggsave(here("figures", "Fig3_metadata.pdf"), dpi = 300, width = 18, height = 15, units = c("cm")) ``` ## FAIR ANALYSIS ```{r} df |> select(Response.ID, FAIRness:Reusable.max.10) |> mutate("Findable" = Findable.max.7 / 7 * 100, "Accesible" = Accessible.max.3 / 3 * 100, "Interoperable" = Interoperable.max.4 / 4 * 100, "Reusable" = Reusable.max.10 / 10 * 100) |> select(Response.ID, FAIRness.level, Findable:Reusable) |> pivot_longer(-c(Response.ID, FAIRness.level)) |> mutate(name = fct_relevel(name, "Findable", "Accesible", "Interoperable", "Reusable")) |> group_by(name) |> summarise(n = n(), mean = mean(value), sd = sd(value)) |> mutate(across(where(is.numeric), round, 1)) |> knitr::kable() ``` ```{r} df |> select(Response.ID, FAIRness:Reusable.max.10) |> mutate("Findable" = Findable.max.7 / 7 * 100, "Accesible" = Accessible.max.3 / 3 * 100, "Interoperable" = Interoperable.max.4 / 4 * 100, "Reusable" = Reusable.max.10 / 10 * 100) |> select(Response.ID, FAIRness.level, Findable:Reusable) |> pivot_longer(-c(Response.ID, FAIRness.level)) |> mutate(name = fct_relevel(name, "Findable", "Accesible", "Interoperable", "Reusable")) |> group_by(name) |> ggplot(aes(x = name, y = value)) + geom_boxplot(width = .3, alpha = .9) + labs(title = "FAIRness of the datasets", x = "", y = "Percentage") ``` # Final Table ```{r} df_short <- df |> select(Imaging = Imaging.modality..multiple.choices., Images = Number.of.images.in.the.dataset, Patients = Number.of.patients.in.the.dataset, country_1, country_2, country_3, Response.ID, FAIRness, FAIRness.level, Findable.max.7, Accessible.max.3, Interoperable.max.4, Reusable.max.10) |> # separate the imaging modality separate_rows(Imaging, sep = ", ") |> mutate(Imaging = fct_collapse(Imaging, "Intraoral 3D Scans or images" = c("Intra-oral 3D scans", "Intraoral photograph"))) |> # now the country pivot_longer( cols = starts_with("country"), names_to = "country_number", values_to = "country" ) |> # filter the full cells of country filter(!is.na(country)) |> select(-country_number) ``` ```{r} # df_short |> # group_by(Response.ID) |> # summarise( # Total_Images = sum(Images, na.rm = TRUE), # Total_Patients = sum(Patients, na.rm = TRUE), # Mean_FAIRness = mean(FAIRness, na.rm = TRUE), # Mean_FAIRness_Level = mean(FAIRness.level, na.rm = TRUE), # Mean_Findable = mean(Findable.max.7, na.rm = TRUE), # Mean_Accessible = mean(Accessible.max.3, na.rm = TRUE), # Mean_Interoperable = mean(Interoperable.max.4, na.rm = TRUE), # Mean_Reusable = mean(Reusable.max.10, na.rm = TRUE) # ) ``` ```{r} df_short |> # group_by(Imaging, Response.ID) |> group_by(Imaging) |> summarise(n = n(), images = sum(Images)) |> knitr::kable() ``` ```{r} df_short |> # group_by(Imaging, Response.ID) |> group_by(Imaging) |> summarise(n = n()) |> knitr::kable() ``` ```{r} df |> select(Images...Panoramic:Images...CBCT, Response.ID) |> pivot_longer(-Response.ID) |> filter(!is.na(value)) |> group_by(name) |> summarise(n = n(), sum = sum(value), mean = mean(value), sd = sd(value)) |> mutate(across(where(is.numeric), round, 1)) |> knitr::kable() ``` ```{r} df_short |> tabyl(Imaging, country) |> knitr::kable() ``` ```{r} df_short |> select(Imaging, FAIRness, Findable.max.7:Reusable.max.10) |> pivot_longer(-Imaging) |> mutate(Imaging = fct_collapse(Imaging, "Other" = c("Cephalometric radiographs", "Intraoral 3D Scans or images" ))) |> group_by(name, Imaging) |> summarise(# n = n(), # sum = sum(value), mean = mean(value) ) |> # sd = sd(value)) |> mutate(name = fct_relevel(name, "Findable.max.7", "Accessible.max.3", "Interoperable.max.4", "Reusable.max.10")) |> pivot_wider(names_from = name, values_from = mean) |> mutate(across(where(is.numeric), round, 1)) |> knitr::kable() ```