BrainX AI

Ophthalomology

A Brazilian Multilabel Ophthalmological Dataset (BRSET)

This dataset consists of 16,266 images from 8,524 Brazilian patients. Demographics, macula, optic disc, and vessels anatomical parameters, focus, illumination, image field, and artifacts as quality control, and multi-labels are included alongside color fundus retinal photos.

Know More

Cardiology/

General

Google Heart Rate Measurement Study Dataset (the “HR Dataset”)

The HR Dataset includes:Skin Tone Information, Facial Video Recording, Heart Rate.Facial video recordings captured during laboratory studies include video of study participants’ full faces. Facial video recordings captured under “free-living” conditions consist of de-identified patches from study participants’ forehead regions (downsampled to 10×5 pixel patches) to preserve participant privacy. The dataset additionally includes a heart rate estimation model (“PHRM-mini”) trained on the dataset.

Know More

General

PMC-Patients

A novel dataset of patient summaries and relations called PMC-Patients to benchmark two ReCDS tasks: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR). Specifically, we extract patient summaries from PubMed Central articles using simple heuristics and utilize the PubMed citation graph to define patient-article relevance and patient-patient similarity. PMC-Patients contains 167k patient summaries with 3.1M patient-article relevance annotations and 293k patient-patient similarity annotations, which is the largest-scale resource for ReCDS and also one of the largest patient collections.

Know More

Surgery

GR00T-H : Surgical robot data

GR00T-H post-trains GR00T N1.6 on surgical robot data from multiple institutions and robot platforms simultaneously. The core challenge is that each institution records data differently — different robots, coordinate conventions, frame rates, camera setups, and state/action representations. GR00T-H solves this by defining per-embodiment modality configs that convert each dataset into a common representation (REL_XYZ_ROT6D for EEF poses) while preserving robot-specific details like clutch handling and motion scaling. The dataset contains 778 hours of real and synthetic procedure episodes.

Know More

Emergency Department

OpenPOCUS: ED Lung US dataset

An open-source dataset of lung POCUS images derived from a multi-center study involving 226 adult patients presenting to emergency departments with respiratory symptoms. Images were acquired using a standardized scanning protocol (12-zone or modified 8-zone) with various POCUS devices. Videos were preprocessed to remove identifiers, and frames were extracted and standardized to 512×512 pixels using letterboxing to maintain aspect ratios. The dataset contains 1,871 video clips comprising 324,027 frames extracted and standardized to 512×512 pixels. Half of the participants (50%) had COVID-19 pneumonia.

Know More

General/

LLM

EkaCare Medical Public Datasets

Collection of datasets curated by EkaCare for development and evaluation of LLMs on healthcare domain. Listed below are a few:

– The Eka Medical ASR Evaluation Dataset enables comprehensive evaluation of automatic speech recognition systems designed to transcribe medical speech into accurate text—a fundamental component of any medical scribe system.

– The Eka Medical Records Parsing Dataset empowers evaluation of AI systems designed to extract structured information from unstructured medical documents, enabling true digitisation of healthcare data while maintaining clinical accuracy.

– The Eka Structured Clinical Note Generation Dataset facilitates evaluation of medical scribe systems capable of transforming transcribed medical conversations into structured, entity-level medical records. Comprising over 156 meticulously transcribed medical conversations between EkaCare’s internal doctors and team members serving as patients, this dataset captures diverse clinical interaction patterns.

Know More

Generative AI/

Surgery

MedVidBench

MedVidBench is a test benchmark for evaluating Video Large Language Models (VLMs) on medical and surgical video understanding. It covers 8 diverse tasks (with GPT-4 and Gemini variants for captioning tasks) across 8 surgical-video datasets. 6,245 test samples across 8 tasks (11 task variants). 8 surgical-video source datasets: AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, JIGSAWS, NurViD. 103,742 video frames (~18 GB) with per-sample FPS and temporal metadata. Bounding-box annotations for 306 region-caption samples.

Know More

Pathology

SafeICU is a freely available database comprising de-identified health-related data from over 2,500 pediatric patients admitted to the pediatric intensive care unit at AIIMS (All India Institute of Medical Sciences), New Delhi, between 2015 and 2025. The AIIMS hosts one of the country’s leading pediatric intensive care units, an independent 8-bed unit that delivers advanced critical care to children with life-threatening illnesses.The SafeICU database includes information such as demographics, bedside vital signs (recorded every 15 seconds), laboratory test results, medications, caregiver notes, microbiology, and mortality. The unit cares for a broad clinical spectrum, with the most frequent reasons for admission being severe respiratory illnesses (such as pneumonia and respiratory failure), sepsis and septic shock, hypertensive crises, congenital heart disease, and acute kidney injury.

Know More

General/

Generative AI/

LLM

Openai: healthbench-professional Eval dataset

HealthBench Professional contains 525 physician-authored tasks spanning three clinician-facing use cases: care consult, writing and documentation, and medical research. Each example is designed to evaluate the next model response in a single-turn or multi-turn conversation between a clinician and a model, and is graded via example-specific criteria, similar to HealthBench. HealthBench Professional was built through a process of physician authored annotations with extensive vetting and quality control. A total of 190 physicians contributed to the effort, with practice experience across 50 countries and 26 medical specialties.

Know More

Cardiology

EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs

This dataset contains a de-identified collection of 100,000 12-lead electrocardiograms (ECGs) with paired structural heart disease (SHD) labels derived from echocardiography, collected at Columbia University Irving Medical Center. Each ECG is provided with raw waveform data sampled at 250 Hz across all 12 leads, along with accompanying demographic and ECG-specific tabular metadata, including age, sex, heart rate, PR interval, QRS duration, and corrected QT interval. Each ECG is annotated with a binary label indicating the presence or absence of structural heart disease based on echocardiographic findings. This dataset was developed as part of the creation of the Columbia Mini-Model, a lightweight deep learning model for SHD detection from ECGs.

Know More

Psychiatry

2,966 publications were used for analysis and are made available after final exclusions. The dataset reveals a slight decline in classical Machine Learning (173) and a surge in Multimodal Foundation Models (144) when compared to the previous year in review. Imaging remains the dominant specialty. Data type distribution shows Image data at 53.9% and Text at 38.2%. This dataset tracks the specific trajectories of various specialties, such as Oncology and Surgery, as they adopt higher-capacity foundation models.

Know More

Imaging

RSNA Screening Mammography Breast Cancer Detection (RSNA-SMBC) Dataset

This dataset is highly enriched to contain 833 cancer examinations, 1188 biopsy-proven benign examinations, and 3897 examinations flagged as abnormal on screening but subsequently determined to be negative (BI-RADS 1) or benign (BI-RADS 2) at diagnostic imaging.

Related publication: Trivedi, H.M., et al., Open-Source Dataset for the RSNA Screening Mammography Cancer Detection Challenge. Radiology: Artificial Intelligence, 2026. 8(2): p. e250375.

Know More

Cardiology/

Generative AI/

LLM

Cardiac Investigations Text dataset

Data consists of clinical test text data (ECGs, CMRs, rest and stress TTEs, ambulatory Holter monitors, CPXs). This study explored the potential of Articulate Medical Intelligence Explorer (AMIE), a large language model-based experimental medical artificial intelligence system, to augment clinical decision-making in this challenging context. We conducted a randomized controlled trial comparing large language model-assisted care with the usual care of complex patients suspected of having a genetic cardiomyopathy, and we curated a real-world dataset of complex cases from a subspecialist cardiology practice.

Know More

Cardiology/

ArchEHR-QA, an expert-annotated dataset of 134 cases from intensive care unit and emergency department settings to evaluate the grounding capabilities of models for responding to patient-initiated queries. The dataset consists of patient-initiated questions posted in public domain, the corresponding clinician-interpreted questions, the excerpts of the EHRs annotated at the sentence-level with relevance to the question, and clinician-generated free-text answers to the questions grounded with EHR sentences. We collect true patient health information needs expressed in real-world health forum messages, we then align the messages to publicly accessible real EHRs. Derived from:MIMIC-III Clinical Database v1.4,MIMIC-IV-Note: Deidentified free-text clinical notes v2.2

Know More

Imaging/

LLM

PARROT dataset: Radiology reports

PARROT is an collaborative initiative to create a multilingual open dataset of radiological reports on which to test LLMs. The aim of PARROT is to represent the diversity of languages and reporting styles to promote applicability of LLM-related research in non-English clinical settings. 2,658 reports, in 14 languages, from 76 authors, from 21 countries.

Know More

Generative AI/

LLM

Medical-Reasoning-SFT

Medical SFT is a curated dataset designed to support the supervised fine-tuning of large language models (LLMs) for medical reasoning tasks. It comprises multi-turn dialogues, clinical case scenarios, and question-answer pairs that reflect the complex reasoning processes encountered in real-world clinical practice.The dataset is intended to help models develop key competencies such as differential diagnosis, evidence-based decision-making, patient communication, and guideline-informed treatment planning. II-Medical SFT is built using a combination of our custom synthetic data generation pipeline and publicly available medical reasoning datasets, ensuring both diversity and clinical relevance. The training dataset comprises 2,197,741 samples.

Know More

Generative AI/

LLM

Medical-Reasoning-SFT-GPT-OSS-120B

A high-quality synthetic dataset of synthetic medical reasoning conversations generated using OpenAI’s gpt-oss-120B model with reasoning effort set to high, designed for supervised fine-tuning of large language models in healthcare applications. Intelligent-Internet/II-Medical-Reasoning-SFT used as a seed dataset. Dataset Statistics: Total Samples: 200,927, Total Tokens: 539,165,577,User Messages: 200,847, Assistant Messages: 200,847, Average Tokens per Sample: 2,683.3, Average User Tokens per Sample: 114.1, Average Assistant Tokens per Sample: 2,569.2. Each conversation demonstrates structured medical thinking with step-by-step reasoning processes.

Know More

Cancer/

GPTNERMED is a novel open synthesized dataset and neural named-entity-recognition (NER) model for German texts in medical natural language processing (NLP).This dataset contains the synthetic German sentences with annotated entities (Medikation, Dosis, Diagnose) from the GPTNERMED project. The sentences as well as the annotations are not manually validated by medical professionals and therefore this dataset is no gold standard dataset.The dataset consists of 9,845 sentences (121,027 tokens by SpaCy Tokenizer, 245,107 tokens by the GPT tokenizer).

Related publication: Frei J, Kramer F. Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform. 2023 Sep;145:104478.

Know More

Imaging

MedPix 2.0 dataset

MedPix 2.0 is derived from a freely open-access source MedPix. It offers a balanced variety of CT and MRI scans of different body parts.For each image, a complete structured clinical case is provided. MedPix^® is a free open-access multimodal online database of medical images, teaching cases, and clinical topics, managed by the National Library of Medicine (NLM) of the National Institutes of Health (NIH). It mainly serves as a support system for Continuing Medical Education (CME) of physicians, nurses, and healthcare students. The database collects clinical cases related to more than 12,000 patients. Each case contains at least one medical image, and the corresponding findings, discussion notes, diagnosis, differential diagnosis, treatment, and follow up. Textual information is reported in a semi-structured format. Attached to the clinical case, there is the topic section, where the disease under investigation is discussed in detail from an academic and general perspective.

Know More

Ophthalomology

FFA-IR : Fundus Fluorescein Angiography Images and Reports dataset

Large-scale medical dataset. FFA-IR collects 766 reports along with 47,247 FFA images from clinical practice. Explainable annotation. FFA-IR annotates 46 categories of lesions with a total of 12,166 regions. Bilingual reports. FFA-IR provides both English and Chinese reports for each case.

Know More

Imaging

LLaVA-Rad dataset

239,025 additional X-ray image-text pairs from MIMIC-CXR using GPT-4, expanding the dataset to 400,042 pairs – more than doubling its original size.

FactEHR is a benchmark dataset designed to evaluate the ability of large language models (LLMs) to perform factual reasoning over clinical notes. It includes:

2,168 deidentified notes from multiple publicly available datasets
8,665 LLM-generated fact decompositions
987,266 entailment pairs evaluating precision and recall of facts
1,036 expert-annotated examples for evaluation

FactEHR supports LLM evaluation across tasks like information extraction, entailment classification, and model-as-a-judge reasoning.

Know More

General/

Generative AI

Mediflow: Synthetic clinical dataset

A large-scale synthetic instruction dataset of 2.5M rows (~700k unique instructions) for clinical natural language processing covering 14 task types and 98 fine-grained input clinical documents.

Know More

Imaging

PadChest-GR (Grounded-Reporting)

A public bilingual dataset of 4,555 CXR studies with substantiated reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual positive and negative findings in English and Spanish. In total, PadChest-GR contains 7,037 sentences of positive findings and 3,422 sentences of negative findings. This dataset is derived from PadChest dataset.

Know More

Imaging/

Pulmonary

OpenPOCUS (Lung ultrasound images)

The largest annotated lung ultrasound (LUS) repositories to date—over 300,000 de-identified frames, drawn from 226 patients with diverse respiratory diagnoses. Includes both COVID and non-COVID pneumonia, pulmonary edema, COPD, healthy controls, and more.

Know More

Endocrinology/

Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics’ Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine.

118,385 emergency department visits with continuous physiological waveforms.

Multimodal Clinical Monitoring in the Emergency Department (MC-MED), a comprehensive, multimodal, and de-identified clinical and physiological dataset. MC-MED includes 118,385 adult ED visits to an academic medical center from 2020 to 2022. Data include continuously monitored vital signs, physiologic waveforms (electrocardiogram, photoplethysmogram, respiration), patient demographics, medical histories, orders, medication administrations, laboratory and imaging results, and visit outcomes. MC-MED is the first dataset to combine detailed physiologic monitoring with clinical events and outcomes for a large, diverse ED population.

Know More

Pathology

DiagSet: a dataset for prostate cancer histopathological image classification

The dataset consists of three different partitions: DiagSet-A, containing over 2.6 million tissue patches extracted from 430 fully annotated scans; DiagSet-B, containing 4675 scans with assigned binary diagnosis; and DiagSet-C, containing 46 scans with diagnosis given independently by a group of histopathologists.

Know More

General/

Generative AI/

LLM

MEDEC Dataset (MEDICAL ERROR DETECTION AND CORRECTION IN CLINICAL NOTES)

It includes 3,848 clinical texts from the MS and University of Washington hospital collections covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism).

Know More

Dermatology/

Imaging/

Ophthalomology/

Pathology

BiomedParseData: A foundation model dataset for joint segmentation, detection and recognition of biomedical objects

BiomedParseData by combining 45 biomedical image segmentation datasets and using GPT-4 to generate the canonical semantic label for each segmented object. GPT-4 was used to create a unifying biomedical object ontology for image analysis and harmonize natural language descriptions with this ontology. This ontology encompasses three main categories (histology, organ and abnormality), 15 meta-object types and 82 specific object types. The resulting BiomedParseData contains 3.4 million distinct image–mask–label triples, spanning nine imaging modalities and 25 anatomic sites representing a large-scale and diverse dataset for semantic-based biomedical image analysis.The images include CT scan, MRI,Chest X-ray, ultrasound, skin lesion photos,Endoscopy images, Pathology whole slide images, and eye OCTs.

Know More

General/

Generative AI/

LLM

UniTox: drug-induced toxicity dataset

A unified dataset of 2,418 FDA-approved drugs with drug-induced toxicity summaries and ratings created by using GPT-4o to process FDA drug labels. UniTox spans eight types of toxicity: cardiotoxicity, liver toxicity, renal toxicity, pulmonary toxicity, hematological toxicity, dermatological toxicity, ototoxicity, and infertility.

Know More

General/

Generative AI/

LLM

MedConceptsQA

An open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard.

Know More

General

SGIR clinical trials dataset

The following is an information retrieval test collection that contains: * 204,855 publicly available clinical trails was crawled from ClinicalTrials.gov. * 60 topics made up of three types: patient case descriptions, patient case summaries and assessor provided ad-hoc queries, totalling an average of 10.2 queries per topic. * 4,000 assessor provided relevance assessment for topic, trial pairs.

Know More

General

TREC Biomedical Tracks datasets

This site hosts the information for three of the five major medical track series that have run at the Text REtrieval Conference (TREC), with links to the other two major track series below. These tracks have sought to provide benchmark datasets and evaluate information retrieval systems focused on many of the most important information access problems in biomedicine.

TREC Genomics (2003-2007). This track focused on genomics researchers seeking relevant biomedical literature.
TREC Medical Records (2011-2012). This track focused on retrieving cohorts of patients from electronic health records (EHRs).
Clinical Decision Support (2014, 2015, 2016). This track focused on clinicians looking for evidence-based full-text literature to support diagnosis, treatment, and testing decisions.
Precision Medicine (2017, 2018, 2019, 2020). This track focused on oncologists looking for evidence-based treatment literature and clinical trials.
Clinical Trials (2021, 2022, 2023). This (ongoing) track focuses on matching patients to relevant clinical trials.

Know More

Imaging

BraTS-Africa (Brain Tumor Segmentation MRI dataset)

The dataset is a collection of retrospective pre-operative brain magnetic resonance imaging (MRI) scans, clinically acquired from six diagnostic centers in Nigeria. The scans are from 146 patients who have brain MRIs indicating central nervous system neoplasms, diffuse glioma, low-grade glioma, or glioblastoma/high-grade glioma. The brain scans were multiparametric MR images (mpMRI), specifically T1, T1 CE, T2, and T2 FLAIR, acquired on 1.5T MRI between January 2010 and December 2022.The expert-annotated tumor sub-regions for each of the 146 cases are provided along with a metadata (csv file) of study location, scanner type, where available.

Know More

General/

Generative AI/

LLM

MTS-Dialog dataset: A collection of 1.7k short doctor-patient conversations and corresponding summaries.

The MTS-Dialog dataset is a new collection of 1.7k short doctor-patient conversations and corresponding summaries (section headers and contents).The training set consists of 1,201 pairs of conversations and associated summaries.The validation set consists of 100 pairs of conversations and their summaries.MTS-Dialog includes 2 test sets; each test set consists of 200 conversations and associated section headers and contents.

The augmented dataset consists of 3.6k pairs of medical conversations and associated summaries created from the original 1.2k training pairs via back-translation using two languages French and Spanish, as described in the paper.

Know More

General/

Generative AI/

LLM

Aci-bench: Ambient Clinical Intelligence Dataset

The corpus, created from domain experts, is designed to model three variations of model-assisted clinical note generation from doctor-patient conversations. These include conversations with (a) calls to a virtual assistant (e.g. required use of wake words or prefabricated, canned phrases), (b) unconstrained directions or discussions with a scribe, and (c) natural conversations between a doctor and patient. Contains 1342 samples.

Know More

General/

Generative AI/

LLM

The UltraMedical Collections

The UltraMedical Collections is a large-scale, high-quality dataset of biomedical instructions, comprising 410,000 synthetic and manually curated samples.

Know More

Imaging

The largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches.A corpus of textual data corresponding to synthetic clinical encounters, including each encounters’ dialogue transcript and clinical notes.

Know More

General/

Generative AI/

LLM

CTO Dataset: A Clinical Trial Outcome Benchmark

The largest trial outcome dataset with around 479K clinical trials, aggregating outcomes from multiple sources of weakly supervised labels, minimizing the noise from individual sources, and eliminating the need for human annotation.

Know More

General

World Health Organization Data Collections

The World Health Organization manages and maintains a wide range of data collections related to global health and well-being as mandated by our Member States.

Know More

Imaging

CheXpert Plus dataset

The CheXpert Plus dataset is a comprehensive collection that brings together text and images in the medical field, featuring a total of 223,462 unique pairs of radiology reports and chest X-rays across 187,711 studies from 64,725 patients.The X-rays are provided in DICOM format, including 47 DICOM metadata elements to support detailed analysis. Accompanying these images are 187,711 radiology reports, each meticulously divided into 11 subsections for thorough examination. Finally, the dataset is enriched with annotations for 14 different chest pathologies across the studies, alongside 8 metadata elements concerning patient information.

Know More

Imaging

Radiology Report Generation Models Evaluation Dataset For Chest X-rays (RadEvalX)

RadEvalX focuses on radiologist evaluations of errors found in automatically generated radiology reports. The dataset includes annotations from two board-certified radiologists, who identified clinically significant and clinically insignificant errors across eight different categories of errors. A balanced dataset of 100 reports for human annotation from an initial set of 590 reports generated using M2Tr

Know More

LLM/

Pathology

Prov-Path Sample Data

2 Sample datasets for histopathology foundation model Prov-GigaPath.

Sample 1: https://zenodo.org/records/10909616

Sample 2: https://zenodo.org/records/10909922

Know More

Genetics

DECIPHER database

DECIPHER is used by the clinical community to share and compare phenotypic and genotypic data. The DECIPHER database contains data from 48,774 patients who have given consent for broad data-sharing.

Know More

Genetics

The MultiCaRe Dataset:Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central

A multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions.

Github: https://github.com/mauro-nievoff/MultiCaRe_Dataset/tree/main

Know More

Imaging/

LLM/

Pulmonary

INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis

INSPECT contains data from 19,438 patients, including CT images, sections of radiology reports, and structured electronic health record (EHR) data (including demographics, diagnoses, procedures, and vitals). Using our provided dataset, Stanford University, develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks.INSPECT is the largest multimodal dataset for enabling reproducible research on strategies for integrating 3D medical imaging and EHR data.

Know More

General/

Pulmonary

FluSense-data

FluSense platform collected and analyzed more than 350,000 waiting room thermal images and 21 million non-speech audio samples from the hospital waiting areas.

Know More

General/

LLM

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

MedAlign is a clinician-generated dataset for instruction following with electronic medical records.The MedAlign dataset contains:

1314 clinician-generated instructions, 983 after removing duplicates using ROUGE-L overlap;
276 longitudinal EHRs;
303 clinician-generated responses to instruction-EHR pairs.

Know More

General/

LLM

MeQSum corpus of 1,000 summarized consumer health questions. In particular, authors show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%.

Related publication: Asma Ben Abacha and Dina Demner-Fushman. 2019. On the Summarization of Consumer Health Questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2228–2234, Florence, Italy. Association for Computational Linguistics.

DATASUS provides information that can serve to support objective analyzes of the health situation, evidence-based decision making and the development of health action programs.Measuring the population’s health status is a tradition in public health. It began with the systematic recording of mortality and survival data (Vital Statistics – Mortality and Live Births). With advances in the control of infectious diseases (Epidemiological and Morbidity information) and with a better understanding of the concept of health and its population determinants, the analysis of the health situation began to incorporate other dimensions of the health status.

Data on morbidity, disability, access to services, quality of care, living conditions and environmental factors have become metrics used in the construction of Health Indicators, which translate into relevant information for the quantification and evaluation of health information.

This section also contains information on the population’s Health Care, registrations (Care Network), hospital and outpatient networks, registration of health establishments, as well as information on financial resources and Demographic and Socioeconomic information.

Know More

General

All of Us Research Hub

All of Us Research Program collects data from a wide variety of sources, including surveys, electronic health records (EHRs), biosamples, physical measurements, and wearables like Fitbit.Most All of Us participants contribute biosamples such as blood and/or saliva. DNA from these samples is extracted and sent to genome centers for genomic analysis, including whole genome sequencing (WGS) and genome-wide genotyping.The All of Us Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, datasets, and tools that use the same codes or data model.

740,000+ participants, 400,000+ electronic health records, 520,000+ biosamples.

Know More

General/

Genetics/

Imaging

UK Biobank

UK Biobank has collected and continues to collect extensive environmental, lifestyle, and genetic data on half a million participants.It includes data for:

Imaging: Brain, heart and full body MR imaging, plus full body DEXA scan of the bones and joints and an ultrasound of the carotid arteries.
Genetics: Whole genome sequencing for all 500,000 participants, whole exome sequencing for 470,000 participants, genotyping (800,000 genome-wide variants and imputation to 90 million variants).
Health linkages: Linkage to a wide range of electronic health-related records, including death, cancer, hospital admissions and primary care records.
Biomarkers: Data on more than 30 key biochemistry markers from all participants, taken from samples collected at recruitment and the first repeat assessment.
Activity monitor: Physical activity data over a 7-day period collected via a wrist-worn activity monitor for 100,000 participants plus a seasonal follow-up on a subset.
Online questionnaires: Data on a range of exposures and health outcomes that are difficult to assess via routine health records, including diet, food preferences, work history, pain, cognitive function, digestive health and mental health.
Repeat baseline assessments: A full baseline assessment is undertaken during the imaging assessment of 100,000 participants.
Samples: Blood & urine was collected from all participants, and saliva for 100,000.

Know More

Pulmonary

ICBHI 2017 Challenge: Respiratory Sound Database

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics – ICBHI 2017.The Respiratory Sound Database contains audio samples, collected independently by two research teams in two different countries, over several years.The database consists of a total of 5.5 hours of recordings containing 6898 respiratory cycles, of which 1864 contain crackles, 886 contain wheezes, and 506 contain both crackles and wheezes, in 920 annotated audio samples from 126 subjects.

Know More

Imaging

VQA-RAD: Visual Question Answering (VQA) for radiology images

A manually constructed dataset where clinicians asked naturally occurring questions about radiology images and provided reference answers. Manual categorization of images and questions provides insight into clinically relevant tasks and the natural language to phrase them. The dataset contains 104 head axial single-slice CTs or MRIs, 107 chest x-rays, and 104 abdominal axial CTs. The final VQA-RAD dataset contains 3,515 total visual questions. Of these, 1,515 (43.1%) are free-form.

Know More

Imaging

SinoCT: Head CT dataset

This dataset contains over 9,000 head CT scans, each labeled as normal or abnormal. Each scan contains a reconstructed image (stored in our institution’s PACS and saved as DICOMs) and a corresponding sinogram (simulated via GE’s CatSim software and saved as numpy arrays). The reconstructed images are 512×512 pixels with a variable number of axial slices per scan. The sinograms are 984×888 pixels with a variable number of axial slices per scan. The full dataset is 1.3T.

Know More

LLM

MedInstruct-52k

A diverse medical task dataset comprising 52,000 instruction response pairs and,MedInstruct-test, a set of clinician-crafted novel medical tasks,to facilitate the building and evaluation of future domain-specific instruction-following models.

The Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD), originally launched in 2014 as the Exome Aggregation Consortium (ExAC), is the result of a coalition of investigators willing to share aggregate exome and genome sequencing data from a variety of large-scale sequencing projects, and make summary data available for the wider scientific community.

v4 release is composed of 730,947 exomes and 76,215 genomes (GRCh38)
gnomAD v4 structural variants (SV) represent 63,046 genomes (GRCh38)
gnomAD v4 copy number variants (CNV) represent variants in less than 1% of 464,297 exomes (GRCh38)

Know More

General

CodiEsp corpus: gold standard Spanish clinical cases coded in ICD10 (CIE10)

The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each.

Know More

Gastroenterology

The Kvasir Datasets(Endoscopy)

3 key datasets for endocscopy:

1. Kvasir-dataset-v2 contains 8,000 images, 8 classes, 1,000 images for upper and lower endoscopy in each class:

2.The Kvasir-Capsule dataset

3.Kvasir SEG dataset

Know More

Gastroenterology

The Nerthus Dataset: Evaluate the quality of bowel preparation for colonoscopy (video dataset)

It contains 21 videos with a total number of 5, 525 frames annotated and verified by medical doctors (experienced endoscopists). The videos are divided into four classes of predefined bowel-preparation qualities.

Know More

Gastroenterology

The SEE-AI Project Dataset(Small Bowel Endoscopy Images)

This dataset comprises 18,481 images extracted from 523 small bowel capsule endoscopy videos. It has annotated 12,3320 images with 23,033 disease lesions and combined with 6,161 normal mucosa images. The annotations are provided in YOLO format.

Know More

General

AWS(Amazon) Marketplace Datasets

More than 80 open source healthcare datasets available through the AWS Open Data Sponsorship Program.

Know More

General

NHS-LLM and OpenGPT datasets

3 datasets:

NHS UK Q/A, 24,665 question and answer pairs, Prompt used: f53cf99826, Generated via OpenGPT using data available on the NHS UK Website. Download here.(Click on view Raw data)
NHS UK Conversations, 2,354 unique conversations, Prompt used: f4df95ec69, Generated via OpenGPT using data available on the NHS UK Website. Download here. (Click on view Raw data)
Medical Task/Solution, 4,688 pairs generated via OpenGPT using GPT-4, prompt used: 5755564c19. Download here. (Click on view Raw data)

Know More

Neurology

AMP®-Parkinson’s Disease Progression Prediction

Data to predict the course of Parkinson’s disease (PD) using protein abundance data. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.This is a time-series code dataset with Kaggle’s time-series API.

Know More

Neurology

Parkinson’s Freezing of Gait Prediction datasets

The data series include three datasets, collected under distinct circumstances:

The tDCS FOG (tdcsfog) dataset, comprising data series collected in the lab, as subjects completed a FOG-provoking protocol.
The DeFOG (defog) dataset, comprising data series collected in the subject’s home, as subjects completed a FOG-provoking protocol
The Daily Living (daily) dataset, comprising one week of continuous 24/7 recordings from sixty-five subjects. Forty-five subjects exhibit FOG symptoms and also have series in the defog dataset, while the other twenty subjects do not exhibit FOG symptoms and do not have series elsewhere in the data.

Know More

Cardiology

AHA Precision Medicine Platform

The Precision Medicine Platform is the only research interface with access to The American Heart Association’s Get With The Guidelines registry data.

2,600+ Hospitals (50% of all US Hospitals)

20+ Years of data collection

13,000,000+ National patient records

90% of stroke discharges

22% of cardiovascular discharges

Know More

General/

Genetics

All of Us Research database

The National Institutes of Health’s All of Us Research Program is building one of the largest biomedical data resources of its kind.

600,000+ participants

350,000+ EHR records

450,000+ biomedical specimen data

Know More

Cancer/

Imaging

NYUMets datasets

3 metastatic cancer datasets available through AWS API.

Time Series Dataset – Each row in the time series dataset represents a point in time, in units of days indexed from each patient’s initial gamma knife radiosurgery. Dataset variables include clinical details related to medication changes, imaging timing/references to raw imaging files, procedure timing, clinical follow up, and outcomes.
Individual Dataset – Each row represents an individual patient with demographic details and summary clinical data.
Gamma Knife Details Dataset – Each row represents an individual gamma knife target to provide further details about available gamma knife radiosurgery.

Know More

Dermatology

Dermofit Image Library

The Dermofit Image Library is a collection of 1,300 focal high quality skin lesion images collected under standardised conditions with internal colour standards. The lesions span across ten different classes including melanomas, seborrhoeic keratosis and basal cell carcinomas. Each image has a gold standard diagnosis based on expert opinion (including dermatologists and dermatopathologists). Images consist of a snapshot of the lesion surrounded by some normal skin.The Dermofit Image Library is available under an academic licence. There is a one-off £75 licence fee associated with this product.

Know More

Imaging

VinDr-CXR:An open dataset of chest X-rays with radiologist’s annotations

A dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases.

Know More

Cardiology/

Pediatrics

EchoNet-Pediatric

The EchoNet-Peds database includes 7,643 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes. The database includes patients ranging from 0-18 years (43% female) with a wide range of sizes.

Know More

Open-i provides access to over 3.7 million images from about 1.2 million PubMed Central^® articles; 7,470 chest x-rays with 3,955 radiology reports; 67,517 images from NLM History of Medicine collection; and 2,064 orthopedic illustrations.

Know More

Imaging

Brain tissue segmentation MRI dataset

A synthetic dataset of brain images simulated across 42 different MR protocols and based on 500 different reference brains from the Human Connectome Project (HCP) (Van Essen et al., 2012), leading to 21,000 simulated brain images,

Know More

Imaging

The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset

An open-source data collection consisting a total of 955 T1-weighted MRIs (Magnetic Resonance Imaging) with manually segmented diverse lesions and metadata

Related publication: Liew, Sook-Lei. The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset – Release 2.0, 2021. Inter-university Consortium for Political and Social Research [distributor], 2022-08-08. https://doi.org/10.3886/ICPSR36684.v5

Know More

Cancer/

Imaging

Breast Cancer MRI Dataset: Duke

The dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients, over a decade, having the following data components:

Demographic, clinical, pathology, treatment, outcomes, and genomic data: Collected from a variety of sources including clinical notes, radiology report, and pathology reports.
Pre-operative dynamic contrast enhanced (DCE)-MRI: Downloaded from PACS systems and de-identified for The Cancer Imaging Archive (TCIA) release in DICOM format.
Locations of lesions in DCE-MRI: Annotations on the DCE-MRI images by radiologists.
Imaging features from DCE-MRI: A set of 529 computer-extracted imaging features by inhouse software.

Know More

A dataset consisting of 53,449 audio samples (over 552 hours in total) crowd-sourced from 36,116 participants through our COVID-19 Sounds app. It also provides participants’ self-reported COVID-19 testing status with 2,106 samples tested positive.

Know More

Imaging

RadGraph: Extracting Clinical Entities and Relations from Radiology Reports

This dataset contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Additionally,there is an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs.

Related publication: Jain, S., Agrawal, A., Saporta, A., Truong, S. Q., Nguyen Duong, D., Bui, T., Chambon, P., Lungren, M., Ng, A., Langlotz, C., & Rajpurkar, P. (2021). RadGraph: Extracting Clinical Entities and Relations from Radiology Reports (version 1.0.0). PhysioNet. https://doi.org/10.13026/hm87-5p47.

Know More

General

Papers with code medical datasets

200+ datasets of various types with links and papers.Includes search options for datatypes, language and more.

Know More

Dermatology

The Medical Information Mart for Intensive Care (MIMIC)-IV database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC).

Related publication: Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2020). MIMIC-IV (version 0.4). PhysioNet. https://doi.org/10.13026/a3wn-hq05.

Know More

General/

Neurology/

Ophthalomology

EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction

A dataset of paired Electroencephalography (EEG) and video-infrared eye tracking (ET) recordings from 356 subjects for more than 47 hours in total. A benchmark consisting of 3 evaluation tasks with increasing difficulty is introduced alongside the dataset.

Know More

Anesthesiology/

General/

Neurology

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management

MONAI: Medical Open Network for Artificial Intelligence

The MONAI framework is the open-source foundation being created by Project MONAI. MONAI is a freely available, community-supported, PyTorch-based framework for deep learning in healthcare imaging.Project MONAI also includes MONAI Label, an intelligent open source image labeling and learning tool that helps researchers and clinicians collaborate, create annotated datasets, and build AI models in a standardized MONAI paradigm.

Know More

Imaging

UPENN-GBM: MRI scans for Glioblastoma (GBM) patients

This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format.

630 patients, 3301 studies, 820,000 + images.

Know More

General/

Imaging/

Pathology/

Surgery

The EchoNet-LVH dataset includes 12,000 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac chamber size and wall thickness.

Know More

Imaging

Japanese Society of Radiological Technology (JSRT) database

The database includes 154 conventional chest radiographs with a lung nodule (100 malignant and 54 benign nodules) and 93 radiographs without a nodule The database also includes additional information such as; patient age, gender, diagnosis (malignant or benign), X and Y coordinates of nodule, simple diagram of nodule location. Lung nodule images were classified into five groups according to the degrees of subtlety.

Know More

Anesthesiology

Behavioral and autonomic dynamics during propofol-induced unconsciousness dataset

Data was collected from nine healthy volunteers during a study of propofol-induced unconsciousness. For all subjects, approximately 3 hours of data were recorded while using target-controlled infusion protocol.Data includes continuous electrocardiogram (ECG); interventions included in the study for patient safety, such as administering phenylephrine (a vasopressor);heart rate variability (HRV) and electrodermal activity (EDA).

Related publication: Subramanian, S., Purdon, P., Barbieri, R., & Brown, E. (2021). Behavioral and autonomic dynamics during propofol-induced unconsciousness (version 1.0). PhysioNet. https://doi.org/10.13026/2rbc-1r03.

Know More

Ophthalomology

A global review of publicly available datasets for ophthalmological imaging

94 open access ophthalmological imaging datasets containing 507 724 images and 125 videos from 122 364 patients.

Know More

Cardiology

PTB-XL: EKG dataset

The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. Total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements.

Know More

Cardiology/

Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (M&Ms) Dataset

375 heterogeneous cardiac magnetic resonance (CMR) datasets acquired by using four different scanner vendors in six hospitals and three different countries (Spain, Canada and Germany).

Know More

Cancer/

Genetics/

Imaging

The Cancer Imaging Archive ( TCAI) dataset collection

TCIA data are organized as “collections”; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as patient outcomes, treatment details, genomics and image analyses are also provided when available. Over 100+ datasets, many of which are public.

Know More

General

n2c2 NLP Research Data Sets

Unstructured notes from the Research Patient Data Registry at Partners Healthcare,Boston,USA (originally developed during the i2b2 project). Clinical Natural Language Processing (NLP) data sets were originally created at a former NIH-funded National Center for Biomedical Computing (NCBC) known as i2b2: Informatics for Integrating Biology and the Bedside. Beginning in 2018, they are officially known as n2c2 (National NLP Clinical Challenges).

Know More

The dataset included over 69,000 dermatology images.International Skin Imaging Collaboration (ISIC) is a global partnership that has organized the world’s largest repository of publicly available dermoscopic images, hosted the first public benchmarks for melanoma detection in dermoscopic images, titled “Skin Lesion Analysis Towards Melanoma Detection”, at the IEEE International Symposium of Biomedical Imaging (ISBI).

Know More

Imaging

CQ500 dataset

A dataset of 491 Head CT scans with 193,317 slices, anonymized dicoms for all the scans and the corresponding radiologists’ reads done by three radiologists with an experience of 8, 12 and 20 years in cranial CT interpretation respectively.

Know More

Critical Care/

Imaging

COVID-Net

Publicly available suite of tailored deep neural network models for tackling different challenges ranging from screening to risk stratification to treatment planning for patients with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

Chest x-rays: 16,352 CXR images across 14,979 patients Clic k here
Chest CT: 201,103 CT slices from 4,501 patients C l i ck here
Chest point-of-care ultrasound: 29,651 POCUS images Click here
COVID-Net ICU:1925 records from 385 patients Click here

Also,expanded to open source TB-Net initiative for tuberculosis screening, Fibrosis-Net initiative for pulmonary fibrosis progression prediction, and Cancer-Net initiative for cancer screening.

Know More

Emergency Department

MIMIC-IV-ED

MIMIC-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2016. 448,972 ED stays with vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses available

Know More

Imaging

Chest X-ray dataset with eye tracking

Chest X-ray dataset with eye tracking and report dictation. Built on MIMIC Chest X-ray dataset.1,083 CXR images.

Related publication:

Karargyris, A., Kashyap, S., Lourentzou, I. et al. Creation and validation of a chest X-ray dataset with eye-tracking and report dictation for AI development.Sci Data 8, 92 (2021).

Know More

Imaging

RICORD: RSNA International COVID-19 Open Annotated Radiology Database

This database is the first multi-institutional, multi-national expert annotated COVID-19 imaging dataset.Annotated by three radiologists with majority vote adjudication by board certified radiologists,RICORD consists of 240 thoracic CT scans and 1,000 chest radiographs contributed from four international sites.

Know More

Anesthesiology

VItalDb dataset

A comprehensive dataset of 6,388 surgical patients composed of intraoperative biosignals and clinical information from the Department of Anesthesiology and Pain Medicine, Seoul National University College of Medicine, Seoul, Korea .

Know More

Pathology

NuCLS

The NuCLS dataset contains over 220,000 labeled nuclei from breast cancer images from The Cancer Genome Atlas( TCGA). These nuclei were annotated through the collaborative effort of pathologists, pathology residents, and medical students.

Know More

Imaging

CheXpert

CheXpert is a public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients from Stanford Hospital.

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.

The data is available since 22 Jan, 2020.

Know More

Imaging

The RSNA 2019 Brain CT Hemorrhage Dataset.

Largest collection of Intracranial hemorrhage CT scans.874 035 images with expert annotations.

Reference: Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge

Know More

Cardiology/

General/

Neurology

PHYSIONET(MIMIC/eICU Collaborative)

One of the most comprehensive source of many datasets in healthcare.Primarily from ICU patients.

https://physionet.org/about/database/

MIMIC – IV Dataset (https://physionet.org/content/mimiciv/0.4/)

Includes:

Clinical datasets such as MIMIC,eICU collaborative and Pediatic ICU datasets.
Waveform datasets with ECG,EEG,arterial blood pressure waveform.
ECG datasets with various pathophysiologic changes and drug interactions.
Fetal datasets including sounds and ECG.
Gait and Balance datasets include gait dynamics for patients with various neurodegenerative disorders.
Neuro and Myoelectic datasets with EEG,EMG and evoked potential waveforms.
Image datasets with Chest X-rays and MRI images.
Computed Tomography Images for Intracranial Hemorrhage Detection and Segmentation
Miscellaneous datasets with text, language,posture and other datasets

Know More

Imaging/

Neurology

ADNI Database

Alzheimer’s disease patient’s imaging(MRI), clinical, genomic, and biomarker data for the purposes of scientific investigation, teaching, or planning clinical research studies.

Center for disease control(CDC) Datasets

Center for Disease Control’s datasets.Useful for incidence,prevalance of various disorders and mortality data from across the US.

Know More

General

Healthcare Cost and Utilization Project (HCUP) datasets

Agency for Healthcare Research and Quality’s HCUP datasets used to identify, track, and analyze US national trends in health care utilization, access, charges, quality, and outcomes.

Know More

General

A place to store, share or find data.A platform for biomedical research.

Know More

General

Nature

Detailed data repositories for biomedical research especially proteins and genetics.

Know More