Comment: Why it is important to share data

Sina Bari, vice president of healthcare and life sciences AI at iMerit, writes that in medicine, the most cited word has always been ‘open’.

The Hippocratic Oath includes the instruction “to consider [my teacher’s] family as my own brothers, and to teach them this art, if they want to learn it, without fee or indenture”. This is reflected in the tradition of medicine, to share knowledge across a global community to continue to advance our science.

At iMerit, we decided to apply that same logic to data. Earlier this year, in collaboration with Segmed and Advocate Health, we released the largest open-source annotated 3D mammography dataset to date: 558 biopsy-confirmed digital breast tomosynthesis (DBT) exams, fully de-identified, validated by US board-certified radiologists, and available for non-commercial research.

The response from the global research community was immediate and enthusiastic, in line with the real and urgent needs in battling breast cancer. Even with advances in care, the impact of breast cancer is profound: one in eight women will be diagnosed with breast cancer in their lifetime, and approximately 310,720 new invasive cases are expected in the US in 2024, according to the American Cancer Society. When detected early, five-year survival exceeds 99%.

Early detection is not just a clinical goal, it is a solvable engineering problem that AI is genuinely well-suited to help with. 3D mammography, or digital breast tomosynthesis (DBT), has already demonstrated meaningful improvements over conventional 2D imaging in detecting invasive cancers, particularly in women with dense breast tissue.

The clinical evidence supporting DBT is robust and growing. The question is whether AI built on top of it will be equally trustworthy. That depends almost entirely on the quality of the data it learns from.

Free annotated 3D mammography data set to advance AI research in breast cancer detection.

What makes this dataset unique

Not all annotated datasets are created equal. This release reflects the clinical and operational standards we apply across our medical AI programs: 558 female patients imaged using digital breast tomosynthesis, with biopsy-confirmed ground truth comprising 271 malignant (48.5%) and 287 benign (51.5%) cases – a deliberate balance that prevents models from learning shortcuts.

The average lesion size is 1.34 cm, with approximately 85% of findings under 2 cm, squarely in the early-detection range where AI assistance is most valuable.

Annotations were validated by US board-certified, MQSA-certified radiologists, using a multi-reader consensus workflow with formal adjudication where experts disagreed. The dataset is fully de-identified in compliance with both the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and General Data Protection Regulation (GDPR).

The breast cancer AI ecosystem advances when researchers at academic medical centres, at lean startups, and at established diagnostics companies have access to a shared, high-quality benchmark. Proprietary data silos produce proprietary blind spots. When every team is training on different data with different annotation standards, comparing results becomes nearly impossible, and progress slows.

Shared, high-quality datasets create a common benchmark. Model performance is the output, but the feeder is how the data itself is defined, labelled, and validated. Over time, those benchmarks become the foundation for more consistent and reliable AI systems across the ecosystem.

What this enables

The dataset is already being used by academic researchers and healthcare technology teams globally. Our goal is straightforward: to support more robust AI models trained on clinically validated ground truth, enable faster iteration cycles for teams working on early-detection applications, and raise the shared standard for annotation quality across the medical imaging field. Which means ultimately producing tools that genuinely support radiologists and reach more patients earlier.

The tradition in medicine has always been to publish, share and advance together because earlier detection shouldn’t depend on who has access to the data, but on how many lives we can reach in time.

CQC says improvements are needed at Huddersfield Royal Infirmary

£10 million pilot launched aimed at supporting clinical researchers

Breakthrough ovarian cancer drug to benefit hundreds of women