Generic selectors
Exact matches only
Search in title
Search in content
post
page

General/

LLM

Artificial Intelligence in Healthcare: 2023 Year in Review Dataset

1,226 “mature” publications from the year 2023 related to AI in healthcare in the final analysis. Among these, the highest number of articles originated from the Imaging specialty (483), followed by Gastroenterology (86), and Ophthalmology (78). Analysis of data types revealed that image data was predominant, utilized in 75.2% of publications, followed by tabular data (12.9%) and text data (11.6%). Deep Learning models were extensively employed, constituting 59.8% of the models used.

For the LLM related publications,after exclusions, 584 publications were finally classified into the 26 different healthcare specialties and used for further analysis. The utilization of Large Language Models (LLMs), is highest in general healthcare specialties, at 20.1%, followed by surgery at 8.5%.

 

 

Related publication: Artificial Intelligence in Healthcare: 2023 Year in Review. Raghav AwasthiShreya MishraPiyush Mathur, et al.

General

Health Artificial Intelligence (HAI) dataset

A collection of 96,332 HAI documents (publications: 75,820, open research datasets: 638, patents: 11,226, grants: 6,113, and clinical trials: 2,535) during 2009 to 2021. On average, 75.12% of the documents were tagged with at least one label related to either health problems or AI technologies (with 92.9% of publications tagged).

 

Related publication: Xuanyu Shi,Daoxin Yin,Dongliang Cui,Jian Du, et al. A Bibliographic Dataset of Health Artificial Intelligence Research. Health Data Sci. 2024;4:0125.

General/

LLM

EquityMedQA dataset for evaluationg harm and biases in LLMs

A collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both  human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries.

 

 

Related publication: A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models. Stephen R. Pfohl, Heather Cole-Lewis, Ivor Horn, Karan Singhal ,et al. arXiv:2403.12025v1 [cs.CY] 

General/

Imaging

AI CAD Database: FDA approved devices image interpretation dataset

A curated database of FDA-cleared AI devices for medical image interpretation, a canonical task among the first to be clinically operationalized. 140 FDA clearances from January 2016 to October 2023 for 104 unique AI-enabled CAD products, with some products having multiple clearances over time.Epecifically focused on AI devices with use cases that are historically referred to as variations of “CAD”, a term that stems from computer-aided detection.

 

Related publication: McNamara, S.L., Yi, P.H. & Lotter, W. The clinician-AI interface: intended use and explainability in FDA-cleared AI devices for medical image interpretation. npj Digit. Med. 7, 80 (2024). https://doi.org/10.1038/s41746-024-01080-1

General/

Pulmonary

Coswara dataset: COVID 19 patient audio recordings

A dataset containing diverse set of respiratory sounds and rich meta-data, recorded between April-2020 and February-2022 from 2635 individuals (1819 SARS-CoV-2 negative, 674 positive, and 142 recovered subjects). The respiratory sounds contained nine sound categories associated with variants of breathing, cough and speech. The rich metadata contained demographic information associated with age, gender and geographic location, as well as the health information relating to the symptoms, pre-existing respiratory ailments, comorbidity and SARS-CoV-2 test status.

 

Related publication: Bhattacharya, D., Sharma, N.K., Dutta, D. et al. Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection. Sci Data10, 397 (2023). https://doi.org/10.1038/s41597-023-02266-0

Dermatology

Dermatology DDx dataset

Dermatology Images dataset of 1947 cases with annotated differential diagnoses(ddx) from multiple dermatologists across 419 conditions, associated risk categories for each condition, softmax prediction for 4 different models.

 

Related publications: 

[1] Stutz, D., et al. (2023).[Conformal prediction under ambiguous ground truth](https://openreview.net/forum?id=CAd6V2qXxc).TMLR.
[2] Stutz, D., et al. (2023).[Evaluating AI systems under uncertain ground truth: a case study in dermatology](https://arxiv.org/abs/2307.02191).ArXiv, abs/2307.02191.

Imaging

AbdomenAtlas-8K

The largest multi-organ dataset (by far) with the spleen, liver, kidneys, stomach, gallbladder, pancreas, aorta, and IVC annotated in 8,448 CT volumes, equating to 3.2 million slices.

 

Related publication: AbdomenAtlas-8K: Annotating 8,000 CT Volumes for Multi-Organ Segmentation in Three Weeks. NeurIPS 2023 • Chongyu QuTiezheng ZhangHualin QiaoJie LiuYucheng TangAlan YuilleZongwei Zhou.arXiv:2305.09666 [eess.IV]

General/

LLM

Red Teaming Large Language Models in Medicine

There are a total of 382 unique prompts, with 1146 total responses across three iterations of ChatGPT (GPT-3.5, GPT-4.0, GPT-4.0 with Internet). 19.8% of the responses were labeled as inappropriate, with GPT-3.5 accounting for the highest percentage at 25.7% while GPT-4.0 and GPT-4.0 with internet performing comparably at 16.2% and 17.5% respectively. 11.8% of responses were deemed appropriate with GPT-3.5 but inappropriate in updated models, highlighting the ongoing need to evaluate evolving LLMs.

 

 

Related publication: Chang CT, Farah H, Gui H, et al. Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior. medRxiv. 2024:2024.2004.2005.24305411.

 

Cancer/

LLM

CORAL: expert-Curated medical Oncology Reports to Advance Language model inference

A fine-grained, expert-labeled dataset of 40 de-identified breast and pancreatic cancer progress notes at University of California, San Francisco, and assessed three recent LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) in zero-shot extraction of detailed oncological information from two narrative sections of clinical progress notes.

 

 

Related publication: Sushil, M., Kennedy, V., Mandair, D., Miao, B., Zack, T., & Butte, A. (2024). CORAL: expert-Curated medical Oncology Reports to Advance Language model inference (version 1.0). PhysioNet. https://doi.org/10.13026/v69y-xa45.

Imaging

Medical Segmentation Decathlon

2,633 three-dimensional images collected across multiple anatomies of interest, multiple modalities, and multiple sources  representative of real-world clinical applications. 10 datasets including CT scans of Abdomen,Lung and MRI of Brain, Prostate.

 

Related publications:

 

LLM

Llama2-MedTuned-Instructions

Llama2-MedTuned-Instructions is an instruction-based dataset developed for training language models in biomedical NLP tasks. It consists of approximately 200,000 samples, each tailored to guide models in performing specific tasks such as Named Entity Recognition (NER), Relation Extraction (RE), and Medical Natural Language Inference (NLI). This dataset represents a fusion of various existing data sources, reformatted to facilitate instruction-based learning.

 

Related publication: Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing. Omid Rohanian, Mohammadmahdi Nouriborji, David A. Clifton. arXiv:2401.00579 [cs.CL]

Related models: https://huggingface.co/nlpie/Llama2-MedTuned-7b ; https://huggingface.co/nlpie/Llama2-MedTuned-13b

 

Neurology

PADS: Parkinson’s disease smartwatch dataset

The largest smartwatch-based dataset including Parkinson’s, other Movement Disorders and Healthy controls (n>400). over 5000 clinical assessment steps from 504 participants, including PD, DD, and healthy controls (HC).

 

 

Related publication: Varghese, J., Brenner, A., Fujarski, M. et al. Machine Learning in the Parkinson’s disease smartwatch (PADS) dataset. npj Parkinsons Dis. 10, 9 (2024).

General

National Neighborhood Data Archive (NaNDA)

The National Neighborhood Data Archive (NaNDA) is a publicly available data archive containing contextual measures for locations across the United States. NaNDA offers theoretically derived, spatially referenced, nationwide measures of the physical and social environment. This dataset is very useful for social determinants of health(SDOH) and public health research.

General

The MultiCaRe Dataset:Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central

A multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions.

 

Related publication: Mauro Andrés Nievas Offidani, Claudio Augusto Delrieux,Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023),Data in Brief,Volume 52,2024,110008,ISSN 2352-3409

Github: https://github.com/mauro-nievoff/MultiCaRe_Dataset/tree/main

Imaging/

LLM/

Pulmonary

INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis

INSPECT contains data from 19,438 patients, including CT images, sections of radiology reports, and structured electronic health record (EHR) data (including demographics, diagnoses, procedures, and vitals). Using our provided dataset, Stanford University, develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks.INSPECT is the largest multimodal dataset for enabling reproducible research on strategies for integrating 3D medical imaging and EHR data.

 

 

Related publication: INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis. Shih-Cheng Huang, Zepeng Huo, Ethan Steinberg, Chia-Chun Chiang, Matthew P. Lungren, Curtis P. Langlotz, Serena Yeung, Nigam H. Shah, Jason A. Fries. arxiv.2311.10798[cs.LG]

General/

LLM

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

MedAlign is a clinician-generated dataset for instruction following with electronic medical records.The MedAlign dataset contains:

  • 1314 clinician-generated instructions, 983 after removing duplicates using ROUGE-L overlap;
  • 276 longitudinal EHRs;
  • 303 clinician-generated responses to instruction-EHR pairs.

Related publication: MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records. Scott L. Fleming, Alejandro Lozano,  Nigam H. Shah ,et al. https://arxiv.org/abs/2308.14089

General/

LLM

EHRSHOT

EHRSHOT, which contains de-identified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients.15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation.

 

 

Related publication: EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models.Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason A. Fries, Nigam H. Shah arXiv:2307.02028

 

Imaging

Scottish Medical Imaging (SMI) Archive

The Scottish Medical Imaging Archive is a collection of population-based, routinely collected medical radiology images.This archive provides access to “analytics-ready” extracts for images between January 1, 2010, and August 31, 2018, which can be used for health care research and the development or validation of artificial intelligence algorithms.An archive of 57.3 million radiology studies linked to their medical records from the whole Scottish population.Modalities: Computerised Tomography (CT), Magnetic Resonance Imaging (MRI), Positron EmissionTomography (PET), Structured Reports (SRs).

 

Related publication: Nind, T., Sutherland, J., Krueger, S., Teviotdale, R., Gillen, K., Reel, P. S., Reel, S., Steele, D., Doney, A., Trucco, M., & Jefferson, E. (2023). The Scottish Medical Imaging Archive: 57.3 million Radiology Studies Linked to their Medical Records. Radiology: Artificial Intelligence. Advance online publication. https://doi.org/10.1148/ryai.220266

General/

Pulmonary

The COUGHVID crowdsourcing dataset

The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks.

 

 

Related publication: Lara Orlandic, Tomas Teijeiro, & David Atienza. (2021). The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms (2.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4498364

General/

LLM

MeQSum: Dataset for medical question summarization

MeQSum corpus of 1,000 summarized consumer health questions. In particular, authors show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%.

 

 

Related publication: Asma Ben Abacha and Dina Demner-Fushman. 2019. On the Summarization of Consumer Health Questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2228–2234, Florence, Italy. Association for Computational Linguistics.

Cancer/

Pathology

NCT-CRC-HE-100K dataset of histological images of human colorectal cancer

This is a set of 100,000 non-overlapping image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and normal tissue.All images are 224×224 pixels (px) at 0.5 microns per pixel (MPP). Tissue classes are: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM).These images were manually extracted from N=86 H&E stained human cancer tissue slides from formalin-fixed paraffin-embedded (FFPE) samples from the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany). Tissue samples contained CRC primary tumor slides and tumor tissue from CRC liver metastases; normal tissue classes were augmented with non-tumorous regions from gastrectomy specimen to increase variability.

General

WorldPop Data

Free and open access to global development data. 44,745 population datasets including birth, pregnancies,child vaccinations and more. WorldPop is based at the University of Southampton and maps populations across the globe. Since 2004, we have partnered with governments, UN agencies and donors to produce almost 45,000 datasets, complementing traditional population sources with dynamic, high-resolution data for mapping human population distributions,

General

ELSI-Brazil (The Brazilian Longitudinal Study of Aging)

ELSI-Brazil (The Brazilian Longitudinal Study of Aging) aims to investigate the social and biological determinants of the aging process and its consequences to individuals and society. It is a nationally representative longitudinal study of community-dwelling adults aged 50 years or older, residing in 70 municipalities located across the five great geographic regions of Brazil. The baseline data collection was carried out in 2015-16 with 9,412 participants. The second wave was conducted in 2019-21 with 9,949 participants, including the sample replacement.

Endocrinology/

Imaging

TDID (Thyroid Digital Image Database)

TDID (Thyroid Digital Image Database) is a freely accessible database of ultrasound images of thyroid nodules from National University of Columbia. Currently, this database has a group of B-mode ultrasound images, which include a complete annotation and diagnostic description of the suspicious images of thyroid lesions, made by expert radiologists. From March 2014 to date, information from 389 patients has been collected.

General

Global Health Data Exchange

The Global Health Data Exchange (GHDx) is a catalog of global health and demographic data. The goal of the GHDx is to help people locate data by cataloging information about data including the topics covered, by providing links to data providers or explaining how to acquire the data, and in cases where we have permission, providing the data directly for download. Use the GHDx to research population census data, surveys, registries, indicators and estimates, administrative health data, and financial data related to health.

General

DATASUS: Brazilian Ministry of Health dataset

DATASUS provides information that can serve to support objective analyzes of the health situation, evidence-based decision making and the development of health action programs.Measuring the population’s health status is a tradition in public health. It began with the systematic recording of mortality and survival data (Vital Statistics – Mortality and Live Births). With advances in the control of infectious diseases (Epidemiological and Morbidity information) and with a better understanding of the concept of health and its population determinants, the analysis of the health situation began to incorporate other dimensions of the health status.

Data on morbidity, disability, access to services, quality of care, living conditions and environmental factors have become metrics used in the construction of Health Indicators, which translate into relevant information for the quantification and evaluation of health information.

This section also contains information on the population’s Health Care, registrations (Care Network),  hospital and outpatient networks, registration of health establishments, as well as information on financial resources and Demographic and Socioeconomic information.

General

All of Us Research Hub

All of Us Research Program collects data from a wide variety of sources, including surveys, electronic health records (EHRs), biosamples, physical measurements, and wearables like Fitbit.Most All of Us participants contribute biosamples such as blood and/or saliva. DNA from these samples is extracted and sent to genome centers for genomic analysis, including whole genome sequencing (WGS) and genome-wide genotyping.The All of Us Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, datasets, and tools that use the same codes or data model.

740,000+ participants, 400,000+ electronic health records, 520,000+ biosamples.

General/

Genetics/

Imaging

UK Biobank

UK Biobank has collected and continues to collect extensive environmental, lifestyle, and genetic data on half a million participants.It includes data for:

  • Imaging: Brain, heart and full body MR imaging, plus full body DEXA scan of the bones and joints and an ultrasound of the carotid arteries.
  • Genetics: Whole genome sequencing for all 500,000 participants, whole exome sequencing for 470,000 participants, genotyping (800,000 genome-wide variants and imputation to 90 million variants).
  • Health linkages: Linkage to a wide range of electronic health-related records, including death, cancer, hospital admissions and primary care records.
  • Biomarkers: Data on more than 30 key biochemistry markers from all participants, taken from samples collected at recruitment and the first repeat assessment.
  • Activity monitor: Physical activity data over a 7-day period collected via a wrist-worn activity monitor for 100,000 participants plus a seasonal follow-up on a subset.
  • Online questionnaires: Data on a range of exposures and health outcomes that are difficult to assess via routine health records, including diet, food preferences, work history, pain, cognitive function, digestive health and mental health.
  • Repeat baseline assessments: A full baseline assessment is undertaken during the imaging assessment of 100,000 participants.
  • Samples: Blood & urine was collected from all participants, and saliva for 100,000.

Pulmonary

ICBHI 2017 Challenge: Respiratory Sound Database

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics – ICBHI 2017.The Respiratory Sound Database contains audio samples, collected independently by two research teams in two different countries, over several years.The database consists of a total of 5.5 hours of recordings containing 6898 respiratory cycles, of which 1864 contain crackles, 886 contain wheezes, and 506 contain both crackles and wheezes, in 920 annotated audio samples from 126 subjects.

 

 

Related publication: Garcia-Mendez JP, Lal A, Herasevich S, Tekin A, Pinevich Y, Lipatov K, Wang H-Y, Qamar S, Ayala IN, Khapov I, et al. Machine Learning for Automated Classification of Abnormal Lung Sounds Obtained from Public Databases: A Systematic Review. Bioengineering. 2023; 10(10):1155. 

Imaging

VQA-RAD: Visual Question Answering (VQA) for radiology images

A manually constructed dataset where clinicians asked naturally occurring questions about radiology images and provided reference answers. Manual categorization of images and questions provides insight into clinically relevant tasks and the natural language to phrase them. The dataset contains 104 head axial single-slice CTs or MRIs, 107 chest x-rays, and 104 abdominal axial CTs. The final VQA-RAD dataset contains 3,515 total visual questions. Of these, 1,515 (43.1%) are free-form.

 

Related publication: Lau, J., Gayen, S., Ben Abacha, A. et al. A dataset of clinically generated visual questions and answers about radiology images. Sci Data5, 180251 (2018).

Imaging

SinoCT: Head CT dataset

This dataset contains over 9,000 head CT scans, each labeled as normal or abnormal. Each scan contains a reconstructed image (stored in our institution’s PACS and saved as DICOMs) and a corresponding sinogram (simulated via GE’s CatSim software and saved as numpy arrays). The reconstructed images are 512×512 pixels with a variable number of axial slices per scan. The sinograms are 984×888 pixels with a variable number of axial slices per scan. The full dataset is 1.3T.

 

Related publication: Hooper SM, Dunnmon JA, Lungren MP, Mastrodicasa D, Rubin DL, Ré C, Wang A, Patel BN. Impact of Upstream Medical Image Processing on Downstream Performance of a Head CT Triage Neural Network. Radiol Artif Intell. 2021 Apr 28;3(4):e200229. doi: 10.1148/ryai.2021200229. PMID: 34350412

LLM

MedInstruct-52k

A diverse medical task dataset comprising 52,000 instruction response pairs and,MedInstruct-test, a set of clinician-crafted novel medical tasks,to facilitate the building and evaluation of future domain-specific instruction-following models.

 

Related publication:ALPACARE:INSTRUCTION-TUNED LARGE LANGUAGE MODELS FOR MEDICAL APPLICATION

General/

Surgery

MedShapeNet – A Large-scale Dataset of 3D Medical Shapes for Computer Vision

MedShapeNet contains over 100,000 medical shapes, including bones, organs, vessels, muscles, etc., as well as surgical instruments.

 

 

Related publication: MedShapeNet – A Large-scale Dataset of 3D Medical Shapes for Computer Vision

General

Med-HALT(Medical Domain Hallucination Test) dataset

This is a dataset used in the Med-HALT research paper. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs’ problem-solving and information retrieval abilities. This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. The authors propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate hallucinations.

Imaging

EMBED: Mammographic dataset

EMBED contains 364,000 screening and diagnostic mammographic exams for 110,000 patients from four hospitals over an 8-year period. The EMBED AWS Open Data release represents 20% of the dataset divided into two equal cohorts at the patient level. This release of the dataset includes 2D and C-view images.

Cardiology

MIMIC-IV-ECHO: Echocardiogram Matched Subset

The MIMIC-IV-ECHO module contains more than 500,000 echocardiograms across 7,243 studies from 4,579 distinct patients.  This subset contains echocardiograms for patients who appear in the MIMIC-IV Clinical Database and were admitted between 2017 and 2019.

 

 

Related publication: Gow, B., Pollard, T., Greenbaum, N., Moody, B., Johnson, A., Herbst, E., Waks, J. W., Eslami, P., Chaudhari, A., Carbonati, T., Berkowitz, S., Mark, R., & Horng, S. (2023). MIMIC-IV-ECHO: Echocardiogram Matched Subset (version 0.1). PhysioNet. https://doi.org/10.13026/ef48-v217.

Genetics

The Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD), originally launched in 2014 as the Exome Aggregation Consortium (ExAC), is the result of a coalition of investigators willing to share aggregate exome and genome sequencing data from a variety of large-scale sequencing projects, and make summary data available for the wider scientific community.

  • v4 release is composed of 730,947 exomes and 76,215 genomes (GRCh38)
  • gnomAD v4 structural variants (SV) represent 63,046 genomes (GRCh38)
  • gnomAD v4 copy number variants (CNV) represent variants in less than 1% of 464,297 exomes (GRCh38)

General

CodiEsp corpus: gold standard Spanish clinical cases coded in ICD10 (CIE10)

The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each.

Related publication: Iker de la Iglesia, María Vivó, Paula Chocrón, Gabriel de Maeztu, Koldo Gojenola, Aitziber Atutxa, An open source corpus and automatic tool for section identification in Spanish health records,Journal of Biomedical Informatics,2023,104461,ISSN 1532-0464,https://doi.org/10.1016/j.jbi.2023.104461.

Gastroenterology

The SEE-AI Project Dataset(Small Bowel Endoscopy Images)

This dataset comprises 18,481 images extracted from 523 small bowel capsule endoscopy videos. It has annotated 12,3320 images with 23,033 disease lesions and combined with 6,161 normal mucosa images. The annotations are provided in YOLO format.

 

 

Related publication: Yokote A, Umeno J, Kawasaki K, Fujioka S, Fuyuno Y, Matsuno Y, Yoshida Y, Imazu N, Miyazono S, Moriyama T, Kitazono T, Torisu T. Small bowel capsule endoscopy examination and open access database with artificial intelligence: The SEE-artificial intelligence project. DEN Open. 2023 Jun 22;4(1):e258. doi: 10.1002/deo2.258. PMID: 37359150; PMCID: PMC10288072.

General

AWS(Amazon) Marketplace Datasets

More than 80 open source healthcare datasets available through the AWS Open Data Sponsorship Program.

General

NHS-LLM and OpenGPT datasets

3 datasets:

Neurology

AMP®-Parkinson’s Disease Progression Prediction

Data to predict the course of Parkinson’s disease (PD) using protein abundance data. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.This is a time-series code dataset with Kaggle’s time-series API.

Neurology

Parkinson’s Freezing of Gait Prediction datasets

The data series include three datasets, collected under distinct circumstances:

  • The tDCS FOG (tdcsfog) dataset, comprising data series collected in the lab, as subjects completed a FOG-provoking protocol.
  • The DeFOG (defog) dataset, comprising data series collected in the subject’s home, as subjects completed a FOG-provoking protocol
  • The Daily Living (daily) dataset, comprising one week of continuous 24/7 recordings from sixty-five subjects. Forty-five subjects exhibit FOG symptoms and also have series in the defog dataset, while the other twenty subjects do not exhibit FOG symptoms and do not have series elsewhere in the data.

Cardiology

AHA Precision Medicine Platform

The Precision Medicine Platform is the only research interface with access to The American Heart Association’s Get With The Guidelines registry data.
2,600+ Hospitals (50% of all US Hospitals)
20+ Years of data collection
13,000,000+ National patient records
90% of stroke discharges
22% of cardiovascular discharges

General/

Genetics

All of Us Research database

The National Institutes of Health’s All of Us Research Program is building one of the largest biomedical data resources of its kind.

600,000+ participants

350,000+ EHR records

450,000+ biomedical specimen data

 

Cancer/

Imaging

NYUMets datasets

3 metastatic cancer  datasets available through AWS API.

  • Time Series Dataset – Each row in the time series dataset represents a point in time, in units of days indexed from each patient’s initial gamma knife radiosurgery. Dataset variables include clinical details related to medication changes, imaging timing/references to raw imaging files, procedure timing, clinical follow up, and outcomes.
  • Individual Dataset – Each row represents an individual patient with demographic details and summary clinical data.
  • Gamma Knife Details Dataset – Each row represents an individual gamma knife target to provide further details about available gamma knife radiosurgery.

Dermatology

Dermofit Image Library

The Dermofit Image Library is a collection of 1,300 focal high quality skin lesion images collected under standardised conditions with internal colour standards. The lesions span across ten different classes including melanomas, seborrhoeic keratosis and basal cell carcinomas. Each image has a gold standard diagnosis based on expert opinion (including dermatologists and dermatopathologists). Images consist of a snapshot of the lesion surrounded by some normal skin.The Dermofit Image Library is available under an academic licence. There is a one-off £75 licence fee associated with this product.

 

 

Related publication: Rees, Aldridge, Fisher, Ballerini (2013), A Color and Texture Based Hierarchical K-NN Approach to the Classification of Non-melanoma Skin Lesions, Color Medical Image Analysis, Lecture Notes in Computational Vision and Biomechanics 6 (M. E. Celebi, G. Schaefer (eds.))

 

 

Imaging

VinDr-CXR:An open dataset of chest X-rays with radiologist’s annotations

A dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data,  18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases.

Cardiology/

Pediatrics

EchoNet-Pediatric

The EchoNet-Peds database includes 7,643 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes. The database includes patients ranging from 0-18 years (43% female) with a wide range of sizes.

 

 

Related publication: Reddy CD, Lopez L, Ouyang D, Zou JY, He B. Video-Based Deep Learning for Automated Assessment of Left Ventricular Ejection Fraction in Pediatric Patients. J Am Soc Echocardiogr. 2023 Feb 6:S0894-7317(23)00068-8. doi: 10.1016/j.echo.2023.01.015. Epub ahead of print. PMID: 36754100.

Imaging

BraTS(Brain Tumor Segmentation) data

All BraTS multimodal scans are available as NIfTI files (.nii.gz) which were were acquired with different clinical protocols and various scanners from multiple (n=19) institutions.The overall survival (OS) data, defined in days, are included in a comma-separated value (.csv) file with correspondences to the pseudo-identifiers of the imaging data. The .csv file also includes the age of patients, as well as the resection status.

 

Related publication:B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, et al. “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)”, IEEE Transactions on Medical Imaging 34(10), 1993-2024 (2015) DOI: 10.1109/TMI.2014.2377694

Pathology

CAMELYON data sets: WSI images

The data in this challenge contains whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections.Depending on the particular data set (see below), ground truth is provided:

  • On a lesion-level: with detailed annotations of metastases in WSI.
  • On a patient-level: with a pN-stage label per patient.

All ground truth annotations were carefully prepared under supervision of expert pathologists. WSI are provided as TIFF images. Lesion-level annotations are provided as XML files. For training, 100 patients will be provided and another 100 patients for testing.The test data set contains 500 slides. 1000 slides with 5 slides per patient .

 

Imaging

Chest X-rays (Indiana University)

The dataset contains 7,471 chest X-ray images in .png  file format and 3955 patients radiology text reports available in .XML format. Each image has been paired with four captions such as Impressions, Findings, Comparison and Indication that provide clear descriptions of the salient entities and events.

Original data source : https://openi.nlm.nih.gov/

 

General/

Imaging

Open-i: National Library of Medicine

Open-i provides access to over 3.7 million images from about 1.2 million PubMed Central® articles; 7,470 chest x-rays with 3,955 radiology reports; 67,517 images from NLM History of Medicine collection; and 2,064 orthopedic illustrations.

Imaging

Brain tissue segmentation MRI dataset

A  synthetic dataset of brain images simulated across 42 different MR protocols and based on 500 different reference brains from the Human Connectome Project (HCP) (Van Essen et al., 2012), leading to 21,000 simulated brain images,

Related Publication: You S, Reyes M. Influence of contrast and texture based image modifications on the performance and attention shift of U-Net models for brain tissue segmentation. Frontiers in Neuroimaging. 2022;1.

Imaging

The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset

An open-source data collection consisting a total of 955 T1-weighted MRIs (Magnetic Resonance Imaging) with manually segmented diverse lesions and metadata

Related publication: Liew, Sook-Lei. The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset – Release 2.0, 2021. Inter-university Consortium for Political and Social Research [distributor], 2022-08-08. https://doi.org/10.3886/ICPSR36684.v5

Cancer/

Imaging

Breast Cancer MRI Dataset: Duke

The dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients, over a decade, having the following data components:

  1. Demographic, clinical, pathology, treatment, outcomes, and genomic data: Collected from a variety of sources including clinical notes, radiology report, and pathology reports.
  2. Pre-operative dynamic contrast enhanced (DCE)-MRI: Downloaded from PACS systems and de-identified for The Cancer Imaging Archive (TCIA) release in DICOM format.
  3. Locations of lesions in DCE-MRI: Annotations on the DCE-MRI images by radiologists.
  4. Imaging features from DCE-MRI: A set of 529 computer-extracted imaging features by inhouse software.

Related publication: Saha, A., Harowicz, M.R., Grimm, L.J., Kim, C.E., Ghate, S.V., Walsh, R. and Mazurowski, M.A., 2018. A machine learning approach to radiogenomics of breast cancer.

General

National Health and Nutrition Examination Survey (NHANES) Data

The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The survey examines a nationally representative sample of about 5,000 persons each year. Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases.

General

Protective Policy Index (PPI) global dataset for COVID-19

This is an original dataset of stringency of public health policy measures that were adopted in response to COVID-19 worldwide by governments at national and sub-national levels. The data set covers governments’ policy responses between January 24, 2020 and December 31, 2020.

Related publication: Shvetsova, O., Zhirnov, A., Adeel, A.B. et al. Protective Policy Index (PPI) global dataset of origins and stringency of COVID 19 mitigation policies. Sci Data 9, 319 (2022). https://doi.org/10.1038/s41597-022-01437-9

Cardiology/

General/

Pathology

Nightingale Open Science Datasets

Multiple datasets available:

  1. silent-cchs-ecgDiagnosing ‘silent’ heart attack (48,000 ECG waveforms)
  2. brca-psj-pathIdentifying high-risk breast cancer (175,000 biopsy slides)
  3. arrest-ntuh-ecgSubtyping cardiac arrest (24,106 ECG waveforms)
  4. fracture-aimi-xrayPredicting fractures (64,000 chest x-rays)
  5. covid-psj-xrayEmergency triage of Covid-19 patients (7,500 chest x-rays)

General/

Pulmonary

COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening

A dataset consisting of 53,449 audio samples (over 552 hours in total) crowd-sourced from 36,116 participants through our COVID-19 Sounds app. It also provides participants’ self-reported COVID-19 testing status with 2,106 samples tested positive.

 

Related publication: COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening

Imaging

RadGraph: Extracting Clinical Entities and Relations from Radiology Reports

This dataset contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Additionally,there is an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs.

 

Related publication: Jain, S., Agrawal, A., Saporta, A., Truong, S. Q., Nguyen Duong, D., Bui, T., Chambon, P., Lungren, M., Ng, A., Langlotz, C., & Rajpurkar, P. (2021). RadGraph: Extracting Clinical Entities and Relations from Radiology Reports (version 1.0.0). PhysioNethttps://doi.org/10.13026/hm87-5p47.

 

General

Papers with code medical datasets

200+ datasets of various types with links and papers.Includes search options for datatypes, language and more.

Dermatology

PH² – a dermoscopic image database

The PH² database includes the manual segmentation, the clinical diagnosis, and the identification of several dermoscopic structures, performed by expert dermatologists, in a set of 200 dermoscopic images.

 

Related publication: Mendonca T, Ferreira PM, Marques JS, Marcal AR, Rozeira J. PH² – a dermoscopic image database for research and benchmarking. Annu Int Conf IEEE Eng Med Biol Soc. 2013;2013:5437-40. doi: 10.1109/EMBC.2013.6610779. PMID: 24110966

General

VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection

Vision-based Fallen Person (VFP290K) dataset consists of 294,713 frames of fallen persons extracted from 178 videos, including 131 scenes in 49 locations. It demonstrated the effectiveness of the features through extensive experiments analyzing the performance shift based on object detection models.

 

Related publication: VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection

Critical Care

HiRID, a high time-resolution ICU dataset

HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand adult patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland (ICU), an interdisciplinary 60-bed unit admitting >6,500 patients per year. The dataset contains de-identified demographic information and a total of 681 routinely collected physiological variables, diagnostic test results and treatment parameters from almost 34 thousand admissions during the period from January 2008 to June 2016. Data is stored with a uniquely high time resolution of one entry every two minutes.

 

Related publication: Faltys, M., Zimmermann, M., Lyu, X., Hüser, M., Hyland, S., Rätsch, G., & Merz, T. (2021). HiRID, a high time-resolution ICU dataset (version 1.1.1). PhysioNethttps://doi.org/10.13026/nkwc-js72.

Critical Care

The eICU Collaborative Research Database

eICU Collaborative Research Database, a multi-center intensive care unit (ICU)database with high granularity data for over 200,000 admissions to ICUs monitored by eICU Programs across the United States.

 

Related publication: The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG and Badawi O. Scientific Data (2018). DOI: http://dx.doi.org/10.1038/sdata.2018.178.

Critical Care

MIMIC -IV

The Medical Information Mart for Intensive Care (MIMIC)-IV database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC).

 

Related publication: Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2020). MIMIC-IV (version 0.4). PhysioNethttps://doi.org/10.13026/a3wn-hq05.

General/

Neurology/

Ophthalomology

EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction

A dataset of paired Electroencephalography (EEG) and video-infrared eye tracking (ET) recordings from 356 subjects for more than 47 hours in total. A benchmark consisting of 3 evaluation tasks with increasing difficulty is introduced alongside the dataset.

 

Related publication: EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction

Anesthesiology/

General/

Neurology

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management

Q-Pain, a dataset for assessing bias in medical QA in the context of pain management. 55 medical question-answer pairs across five different types of pain management: each question includes a detailed patient-specific medical scenario (“vignette”) designed to enable the substitution of multiple different racial and gender “profiles” and to evaluate whether bias is present when answering whether or not to prescribe medication.

 

Related publication: Logé, C., Ross, E., Dadey, D. Y. A., Jain, S., Saporta, A., Ng, A., & Rajpurkar, P. (2021). Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management (version 1.0.0). PhysioNethttps://doi.org/10.13026/2tdv-hj07.

Imaging

Chest ImaGenome Dataset

Dataset contributes significantly to the research community by providing 1) 1,256 combinations of relation annotations between 29 CXR anatomical locations (objects with bounding box coordinates) and their attributes, structured as a scene graph per image, 2) over 670,000 localized comparison relations (for improved, worsened, or no change) between the anatomical locations across sequential exams, as well as 3) a manually annotated gold standard scene graph dataset from 500 unique patients.

 

Related publication: Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNethttps://doi.org/10.13026/wv01-y230.

General

Therapeutics Data Commons (TDC)

TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and diverse types of data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards.

 

Related publication: Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Imaging

Report-Annotated Duke Chest CT (RAD-ChestCT)

The RAD-ChestCT dataset is a imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset.

 

 

Related publication: Draelos et al., “Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes,” Medical Image Analysis 2021. DOI: 10.1016/j.media.2020.101857

Dermatology

MED-NODE

A dataset consists of 70 melanoma and 100 naevus images from the digital image archive of the Department of Dermatology of the University Medical Center Groningen (UMCG) used for the development and testing of the MED-NODE system for skin cancer detection from macroscopic images. The file contains 170 images (70 melanoma and 100 nevi cases).

 

Related publications: I. Giotis, N. Molders, S. Land, M. Biehl, M.F. Jonkman and N. Petkov: “MED-NODE: A computer-assisted melanoma diagnosis system using non-dermoscopic images”, Expert Systems with Applications, 42 (2015), 6578-6585

General

BigBIO: Biomedical NLP datasets

BIGBIO a community library of 126+ biomedical NLP datasets currently covering 12 task categories and 10+ languages with • programmatic access. BIGBIO enables reproducible data-centric machine learning workflows, by focusing on programmatic access to datasets and their metadata in a uniform format.

 

Related Publication: BIGBIO: A Framework for Data-Centric Biomedical Natural Language Processing

Dermatology

PAD-UFES-20: a skin lesion dataset collected from smartphones

The dataset consists of 2,298 samples of six different types of skin lesions. Each sample consists of a clinical image and up to 22 clinical features including the patient’s age, skin lesion location, Fitzpatrick skin type, and skin lesion diameter. ll BCC, SCC, and MEL are biopsy-proven.In total, there are 1,373 patients, 1,641 skin lesions, and 2,298 images present in the dataset. The remaining ones may have clinical diagnosis according to a consensus of a group of dermatologists. In total, approximately 58% of the samples in this dataset are biopsy-proven.

 

Related publication: PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones

General/

Microbiology

International Severe Acute Respiratory and Emerging Infection Consortium (ISARIC) COVID-19 dataset

The database includes data from more than 705,000 patients, collected in more than 60 countries and 1,500 centres worldwide. Patient data are available from acute hospital admissions with COVID-19 and outpatient follow-ups. The data include signs and symptoms, pre-existing comorbidities, vital signs, chronic and acute treatments, complications, dates of hospitalization and discharge, mortality, viral strains, vaccination status, and other data.

 

 

Related publication: ISARIC-COVID-19 dataset: A Prospective, Standardized, Global Dataset of Patients Hospitalized with COVID-19

Dermatology

SNU dataset

2201 images with diagnoses based on biopsy or clinical impression.174 disease classes for the model training.

 

 

General

BioRED: a rich biomedical relation extraction dataset

Biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of 600 PubMed abstracts.

 

 

Related dataset: Ling Luo, et al. BioRED: a rich biomedical relation extraction dataset, Briefings in Bioinformatics, 2022

Imaging

RadImageNet

The RadImageNet database includes 1.35 million annotated CT, MRI, and ultrasound images of musculoskeletal, neurologic, oncologic, gastrointestinal, endocrine, and pulmonary pathology. The RadImageNet database contains medical images of 3 modalities, 11 anatomies, and 165 pathologic labels.

Imaging

BRAX, a Brazilian labeled chest X-ray dataset

BRAX dataset provides 40,967 images, 24,959 imaging studies for 19,351 patients presenting to the Hospital Israelita Albert Einstein. All images have been verified by trained radiologists and de-identified to protect patient privacy. Fourteen labels were derived from free-text radiology reports written in Brazilian Portuguese using Natural Language Processing.

 

 

Related publication: BRAX, a Brazilian labeled chest X-ray dataset

Imaging

MONAI: Medical Open Network for Artificial Intelligence

The MONAI framework is the open-source foundation being created by Project MONAI. MONAI is a freely available, community-supported, PyTorch-based framework for deep learning in healthcare imaging.Project MONAI also includes MONAI Label, an intelligent open source image labeling and learning tool that helps researchers and clinicians collaborate, create annotated datasets, and build AI models in a standardized MONAI paradigm.

Imaging

UPENN-GBM: MRI scans for Glioblastoma (GBM) patients

This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format.

630 patients, 3301 studies, 820,000 + images.

General/

Imaging/

Pathology/

Surgery

Grand Challenge: Image analysis datasets and algorithms

A platform for end-to-end development of machine learning solutions in biomedical imaging.Grand Challenge was developed in 2010 to make it easy for organizers of challenges to set up a website for a particular challenge and to bring all information on challenges in the domain of biomedical image analysis available at one place.This system has been operational since 2017 and has been used by over 300 challenges,70,000 users with more than 1000 algorithms.

Dermatology

Seven-Point Checklist Dermatology Dataset

A database for evaluating computerized image-based prediction of the 7-point skin lesion malignancy checklist. The dataset includes over 2000 clinical and dermoscopy color images, along with corresponding structured metadata tailored for training and evaluating computer aided diagnosis (CAD) systems.

 

Related publication: J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “Seven-Point Checklist and Skin Lesion Classification using Multitask Multimodal Neural Nets,” IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 2, pp. 538–546, 2019.

Imaging/

Neurology

OpenNeuroDatasets

A free and open platform for validating and sharing BIDS-compliant MRIPETMEGEEG, and iEEG data.720 public datasets and growing.

 

 

Webpage: https://openneuro.org/

Cardiology

EchoNet – LVH

The EchoNet-LVH dataset includes 12,000 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac chamber size and wall thickness.

 

 

Related publication: High-Throughput Precision Phenotyping of Left Ventricular Hypertrophy with Cardiovascular Deep Learning

Imaging

Japanese Society of Radiological Technology (JSRT) database

The database includes 154 conventional chest radiographs with a lung nodule (100 malignant and 54 benign nodules) and 93 radiographs without a nodule  The database also includes additional information such as; patient age, gender, diagnosis (malignant or benign), X and Y coordinates of nodule, simple diagram of nodule location. Lung nodule images were classified into five groups according to the degrees of subtlety.

 

Related publication:  Shiraishi J, Katsuragawa S, lkezoe J, et al: Development of a digital image database for chest radiographs with and without a lung nodule: Receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. AJR 174:71-74, 2000.

Anesthesiology

Behavioral and autonomic dynamics during propofol-induced unconsciousness dataset

Data was collected from nine healthy volunteers during a study of propofol-induced unconsciousness.  For all subjects, approximately 3 hours of data were recorded while using target-controlled infusion protocol.Data includes continuous electrocardiogram (ECG); interventions included in the study for patient safety, such as administering phenylephrine (a vasopressor);heart rate variability (HRV) and electrodermal activity (EDA).

 

Related publication: Subramanian, S., Purdon, P., Barbieri, R., & Brown, E. (2021). Behavioral and autonomic dynamics during propofol-induced unconsciousness (version 1.0). PhysioNethttps://doi.org/10.13026/2rbc-1r03.

Ophthalomology

A global review of publicly available datasets for ophthalmological imaging

94 open access ophthalmological imaging datasets containing 507 724 images and 125 videos from 122 364 patients.

Cardiology

PTB-XL: EKG dataset

The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. Total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements.

 

Related publication: Wagner, P., Strodthoff, N., Bousseljot, R.-D., Kreiseler, D., Lunze, F.I., Samek, W., Schaeffter, T. (2020), PTB-XL: A Large Publicly Available ECG Dataset. Scientific Data. https://doi.org/10.1038/s41597-020-0495-6

Cardiology/

Dermatology/

General/

Imaging

Stanford AIMI Shared Datasets

A collection of de-identified annotated medical imaging data to foster transparent and reproducible collaborative research. X-rays, CT scans, MRIs,Echocardiography and Dermatology images.

Dermatology

DDI – Diverse Dermatology Images: Stanford AIMI Dataset

Diverse Dermatology Images (DDI) dataset—the first publicly available, deeply curated, and pathologically confirmed image dataset with diverse skin tones. The DDI was retrospectively selected from reviewing pathology reports in Stanford Clinics from 2010-2020. It has a total of 656 images representing 570 unique patients.

General

Huggingface datasets

Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.Currently over 2658 datasets, and more than 34 metrics available.At least 13 datasets with “medical” term search.Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model.

Pulmonary

DCSM Sleep Staging Dataset

The DCSM dataset consists of 255 randomly selected and fully anonymized overnight lab-based PSG recordings from patients visiting the DCSM for the diagnosis of non-specific sleep related disorders. The DCSM dataset represents a diverse cohort of Danish patients with respect to demographic characteristics, diagnostic background and sleep/non-sleep related medication usage, collected between 2015-2018.

 

Pulmonary

Dreem Open Datasets

Two publicly-available datasets, DOD-H including 25 healthy volunteers and DOD-O including 55 patients suffering from obstructive sleep apnea (OSA). Both datasets have been scored by 5 sleep technologists from different sleep centers. We developed a framework to compare automated approaches to a consensus of multiple human scorers.

 

Related publication: A. Guillot, F. Sauvet, E. H. During and V. Thorey, “Dreem Open Datasets: Multi-Scored Sleep Datasets to Compare Human and Automated Sleep Staging,” in IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 28, no. 9, pp. 1955-1965, Sept. 2020, doi: 10.1109/TNSRE.2020.3011181.

Cardiology/

Imaging

Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (M&Ms) Dataset

375 heterogeneous cardiac magnetic resonance (CMR) datasets acquired by using four different scanner vendors in six hospitals and three different countries (Spain, Canada and Germany).

 

Related publication: V. M. Campello et al., “Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation: The M&Ms Challenge,” in IEEE Transactions on Medical Imaging, vol. 40, no. 12, pp. 3543-3554, Dec. 2021, doi: 10.1109/TMI.2021.3090082.

Cancer/

Genetics/

Imaging

The Cancer Imaging Archive ( TCAI) dataset collection

TCIA data are organized as “collections”; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as patient outcomes, treatment details, genomics and image analyses are also provided when available. Over  100+ datasets, many of which are public.

General

n2c2 NLP Research Data Sets

Unstructured notes from the Research Patient Data Registry at Partners Healthcare,Boston,USA (originally developed during the i2b2 project). Clinical Natural Language Processing (NLP) data sets were originally created at a former NIH-funded National Center for Biomedical Computing (NCBC) known as i2b2: Informatics for Integrating Biology and the Bedside. Beginning in 2018, they are officially known as n2c2 (National NLP Clinical Challenges).

General

emrQA dataset

A publicly available EMR Question Answering (QA) corpus by creating a large-scale dataset, emrQA, using a novel semi-automated generation framework that allows for minimal expert involvement and re-purposes existing annotations available for other clinical NLP tasks.EmrQA has 1 million question-logical form and 400,000+ question answer evidence pairs. The dataset uses existing NLP task annotations from the i2b2 Challenge datasets.

 

 

Related publication: Pampari, A., Raghavan, P., Liang, J.J., & Peng, J. (2018). emrQA: A Large Corpus for Question Answering on Electronic Medical Records. EMNLP.

Anesthesiology

VSCapture: An open source tool for Data acquisition from anesthesia monitor

VSCapture, an open source tool developed in C# programming language on the .NET/Mono platform that allows the tool to run on Windows, Macintosh OS X, Linux Ubuntu operating systems.

 

Related Publication: Data acquisition from S/5 GE Datex anesthesia monitor using VSCapture.

 

Related Dataset: The University of Queensland Vital Signs Dataset.

 

The University of Queensland Vital Signs Dataset contains a wide range of patient monitoring data and vital signs that were recorded during 32 surgical cases where patients underwent anaesthesia at the Royal Adelaide Hospital.

Cancer/

Pathology

Prostate cANcer graDe Assessment (PANDA) Challenge dataset

12,625 whole-slide images (WSIs) of prostate biopsies were available for model development (the development set), 393 for performance evaluation during the competition phase (the tuning set), 545 as the internal validation set in the postcompetition phase and 1,071 for external validation from 6 different sites.

 

Related publication: Bulten, W., Kartasalo, K., Chen, PH.C. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat Med (2022). https://doi.org/10.1038/s41591-021-01620-2

Cardiology/

General

Hero DMC Heart Institute(HDHI): Hospital admissions dataset

This is a dataset from tertiary care medical college and hospital in India’s cardiology unit which had 14,845 admissions corresponding to 12,238 patients.

 

Related publication: Bollepalli, S.C.; Sahani, A.K.; Armoundas, A.A. ,et al. An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission to a Cardiac Unit. Diagnostics 2022, 12, 241.

https://doi.org/10.3390/diagnostics12020241

 

Dermatology

International Skin Imaging Collaboration(ISIC) Dataset

The dataset included over 69,000 dermatology images.International Skin Imaging Collaboration (ISIC) is a global partnership that has organized the world’s largest repository of publicly available dermoscopic images, hosted the first public benchmarks for melanoma detection in dermoscopic images, titled “Skin Lesion Analysis Towards Melanoma Detection”, at the IEEE International Symposium of Biomedical Imaging (ISBI).

Imaging

CQ500 dataset

A dataset of 491 Head CT scans with 193,317 slices, anonymized dicoms for all the scans and the corresponding radiologists’ reads done by three radiologists with an experience of 8, 12 and 20 years in cranial CT interpretation respectively.

 

Related publication: Development and Validation of Deep Learning Algorithms for Detection of Critical Findings in Head CT scan.

Critical Care/

Imaging

COVID-Net

Publicly available  suite of tailored deep neural network models for tackling different challenges ranging from screening to risk stratification to treatment planning for patients with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 

  • Chest x-rays: 16,352 CXR images across 14,979 patients Click here
  • Chest CT: 201,103 CT slices from 4,501 patients Click here
  • Chest point-of-care ultrasound: 29,651 POCUS images Click here
  • COVID-Net ICU:1925 records from 385 patients   Click here

Also,expanded to open source TB-Net initiative for tuberculosis screening, Fibrosis-Net initiative for pulmonary fibrosis progression prediction, and Cancer-Net initiative for cancer screening.

Emergency Department

MIMIC-IV-ED

MIMIC-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2016. 448,972 ED stays with vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses available

Imaging

RICORD: RSNA International COVID-19 Open Annotated Radiology Database

This database is the first multi-institutional, multi-national expert annotated COVID-19 imaging dataset.Annotated by three radiologists with majority vote adjudication by board certified radiologists,RICORD consists of 240 thoracic CT scans and 1,000 chest radiographs contributed from four international sites.

Anesthesiology

VItalDb dataset

A comprehensive dataset of 6,388 surgical patients composed of intraoperative biosignals and clinical information from the Department of Anesthesiology and Pain Medicine, Seoul National University College of Medicine, Seoul, Korea .

Pathology

NuCLS

The NuCLS dataset contains over 220,000 labeled nuclei from breast cancer images from The Cancer Genome Atlas( TCGA). These nuclei were annotated through the collaborative effort of pathologists, pathology residents, and medical students.

Imaging

CheXpert

CheXpert is a  public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients from Stanford Hospital.

Cancer/

Genetics

Genomic Data Commons(GDC) datasets

The GDC Portal is a platform from National Cancer Institute(NCI) with cancer related genomic data for 80,000+ cases.

Imaging

BIMCV-COVID19 Imaging Datasets

BIMCV-COVID19+ dataset is a large dataset with chest X-ray images  and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV).These iterations of the database include 7377 CR, 9463 DX and 6687 CT studies.

Imaging

VinBigData Chest X-ray abnormalities detection

Provided on Kaggle by the Vingroup Big Data Institute (VinBigData) aims to promote fundamental research and investigate novel and highly-applicable technologies.A dataset consisting of 18,000 images that have been annotated by experienced radiologists.

Cardiology

EchoNet -Dynamic

The EchoNet-Dynamic database includes 10,030 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes.

 

Related publication: Video-based AI for beat-to-beat assessment of cardiac function

Genetics/

Pharmacology

PGxCorpus: a Manually Annotated Corpus for Pharmacogenomics

941 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly genes variations, gene, drugs and phenotypes), and relationships between those.

General

CENTAUR LABS

40+ speciality classified list of open source datasets for healthcare with direct links to the datasets and more information.

General

DATA WORLD – HEALTHCARE

More than a 100 healthcare related datasets from around the world, classified and annotated.

General

Determinants of COVID-19 mortality in the United States dataset (BrainX)

Dataset created for the purpose of continuing research into COVID-19. However with information from all 50 states and the District of Columbia, many US statistics can be compared.

Pharmacology

Drug Induced Liver injury(DILI) Dataset

The DILIrank dataset is an updated version of the LTKB Benchmark dataset. DILIrank consists of 1,036 FDA-approved drugs that are divided into four classes according to their potential for causing drug-induced liver injury (DILI).

Ophthalomology

SUSTech -SYSU dataset

Dataset for automatically segmenting and classifying corneal ulcers with 712 ocular staining images and the associated segmentation labels for flaky corneal ulcers.

General

Harvard Dataverse

4000+ healthcare datasets made available from Harvard University.Searchable and diverse.

Pathology

PanNUke Dataset

Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources.

Imaging

ACR COVID-19 Imaging Dataset

A dataset with Images,mainly Chest X-rays from COVID-19 patients.

General

C3.ai COVID-19 Data Lake

Multiple data sources for COVID-19 in a unified data model, ready for analysis at one place.

General

COVID-19 Open Research Dataset Challenge (CORD-19)

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

General

Novel Corona Virus 2019 Dataset

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.

The data is available since 22 Jan, 2020.

Imaging

The RSNA 2019 Brain CT Hemorrhage Dataset.

Largest collection of Intracranial hemorrhage CT scans.874 035 images with expert annotations.

 

Reference: Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge

Cardiology/

General/

Neurology

PHYSIONET(MIMIC/eICU Collaborative)

One of the most comprehensive source of many datasets in healthcare.Primarily from ICU patients.

https://physionet.org/about/database/

MIMIC – IV Dataset (https://physionet.org/content/mimiciv/0.4/)

Includes:

  • Clinical datasets such as MIMIC,eICU collaborative and Pediatic ICU datasets.
  • Waveform datasets with ECG,EEG,arterial blood pressure waveform.
  • ECG datasets with various pathophysiologic changes and drug interactions.
  • Fetal datasets including sounds and ECG.
  • Gait and Balance datasets include gait dynamics for patients with various neurodegenerative disorders.
  • Neuro and Myoelectic datasets with EEG,EMG and evoked potential waveforms.
  • Image datasets with Chest X-rays and MRI images.
  • Computed Tomography Images for Intracranial Hemorrhage Detection and Segmentation
  • Miscellaneous datasets with text, language,posture and other datasets

Imaging/

Neurology

ADNI Database

Alzheimer’s disease patient’s imaging(MRI), clinical, genomic, and biomarker data for the purposes of scientific investigation, teaching, or planning clinical research studies.

http://adni.loni.usc.edu/data-samples/access-data/

Ophthalomology

RIM-ONE

RIM-ONE is a database for optic disc and cup segmentation evaluation by Medical Image Analysis group.

Critical Care

AmsterdamUMCdb

Contains data related to 23,376 intensive care unit and high dependency unit admissions at Amsterdam University Medical Center of adult patients from 2003-2016.

 

Pharmacology

FDA Adverse Event Reporting System (FAERS)

The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA

Microbiology

Malaria Dataset

A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.The dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells.

Ophthalomology

RIGA Dataset :Retinal fundus images for glaucoma analysis

A de-identified dataset of retinal fundus images for glaucoma analysis (RIGA) derived from three sources with 750 original images and 4500 manual marked images

 

Ophthalomology

High-Resolution Fundus (HRF) Image Database

The public database contains 15 images of healthy patients, 15 images of patients with diabetic retinopathy and 15 images of glaucomatous patients.

Ophthalomology

DR HAGIS:Diabetic Retinopathy, Hypertension, Age-related macular degeneration and Glacuoma ImageS

39 images for development of vessel extraction algorithms suitable for retinal screening programmes.

Cancer

NLST Datasets: National Cancer Institute

Datasets from National Cancer Institute of over 54000 patients. They include data on participant characteristics, screening exam results, diagnostic procedures, lung cancer, and mortality. Images from over 75,000 CT screening exams are available. Over 1,200 pathology images from a subset of NLST lung cancer patients (~500 of over 2,000 patients) may be viewed.

Pulmonary

NSRR Datasets:National Sleep Research Resource

Polysomnography dataset from NSRR for sleep studies.Large collection of deidentified physiologic signals perfect for ML development.

Dermatology

The HAM10000 dataset

A large collection of multi-source dermatoscopic images of common pigmented skin lesions containing 10000 images.

Related publication:The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

General

UCI Machine Learning Repository

This open source repository has more than 400 datasets including healthcare(100+) and non-healthcare ones in searchable and categorized format.

General

Centers for Medicare and Medicaid(CMS) datasets with ResDAC link.

CMS datasets provide US Medicare and Medicaid datasets.

ResDAC(The Research Data Assistance Center) provides free support to users of CMS datasets.Link:  https://www.resdac.org/learn

General

Center for disease control(CDC) Datasets

Center for Disease Control’s datasets.Useful for incidence,prevalance of various disorders and mortality data from across the US.

General

Healthcare Cost and Utilization Project (HCUP) datasets

Agency for Healthcare Research and Quality’s HCUP datasets used to identify, track, and analyze US national trends in health care utilization, access, charges, quality, and outcomes.

General

NHS datasets

UK government’s National Health services datasets.NHS choices datasets are useful for NLP and sentiment analysis both for GPs and hospitals.

Imaging

OASIS Brain MRI dataset

Brain MRI datasets from Open Accesss series of Imaging Studies(OASIS).

Cancer

National Cancer Institute(NCI)-SEER datasets

Cancer epidemiology data available through NCI’s Surveillance,Epidemiology and End Result Program(SEER).

Cancer/

Genetics

BROAD Institute’s Cancer program datasets

Cancer and genomics datasets.

Imaging

MURA

A dataset of 14,000+ anonymized, radiologist labeled musculoskeletal X-rays from 12,000+ patients from Stanford ML group.

 

Related publication: https://arxiv.org/abs/1712.06957

Imaging

fastMRI

1500+ knee MRI anonymized dataset from NYU.

General

NLTK : Natural language toolkit

One stop to learn Natural Language processing and more.

Related publication: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

General

DAIR.AI

An excellent resource for trends and updates in AI, especially NLP by Elvis Saravia.

General

Data science article collection

An excellent collection of articles on data science.

General

Google Dataset Search

Google’s powerful search engine to assist with dataset search.

Imaging

NIH CXR14 dataset

Over 100,000 anonymized chest x-ray images and their corresponding data from more than 30,000 patients, including many with advanced lung disease.

Imaging

NIH Deep Lesion

NIH release of  a dataset containing 32,000 CT scan images with annotated lesions  belonging to 4400 unique patients.

General

Blue Button 2.0

A CMS initiative to democratize research and development using beneficiary data.Greater than 70 million patient dataset available.

General

National Institute of Health

The link below is for NIH’s strategic plan for data science in healthcare.A must read for anyone using data in healthcare for research and innovation

Imaging

NIH Clinical Center

Largest open source Chest X-Ray data set available through NIH’s clinical center.See the link in the article to access the data.Also available through GITHUB and KAGGLE.

General

GITHUB

One of the the largest and most advanced software development platform in the world with many datasets and repositories.

General

KAGGLE

Kaggle is a great resource for de-identified datasets in healthcare.

General

DataMed

A biomedical data search engine which searches for datasets across registries.

General

Mendeley

A place to store, share or find data.A platform for biomedical  research.

General

Nature

Detailed data repositories for biomedical research especially proteins and genetics.