Generic selectors
Exact matches only
Search in title
Search in content
post
page

General

NHS-LLM and OpenGPT datasets

3 datasets:

Neurology

AMP®-Parkinson’s Disease Progression Prediction

Data to predict the course of Parkinson’s disease (PD) using protein abundance data. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.This is a time-series code dataset with Kaggle’s time-series API.

Neurology

Parkinson’s Freezing of Gait Prediction datasets

The data series include three datasets, collected under distinct circumstances:

  • The tDCS FOG (tdcsfog) dataset, comprising data series collected in the lab, as subjects completed a FOG-provoking protocol.
  • The DeFOG (defog) dataset, comprising data series collected in the subject’s home, as subjects completed a FOG-provoking protocol
  • The Daily Living (daily) dataset, comprising one week of continuous 24/7 recordings from sixty-five subjects. Forty-five subjects exhibit FOG symptoms and also have series in the defog dataset, while the other twenty subjects do not exhibit FOG symptoms and do not have series elsewhere in the data.

General/

Genetics

All of Us Research database

The National Institutes of Health’s All of Us Research Program is building one of the largest biomedical data resources of its kind.

600,000+ participants

350,000+ EHR records

450,000+ biomedical specimen data

 

Cancer/

Imaging

NYUMets datasets

3 metastatic cancer  datasets available through AWS API.

  • Time Series Dataset – Each row in the time series dataset represents a point in time, in units of days indexed from each patient’s initial gamma knife radiosurgery. Dataset variables include clinical details related to medication changes, imaging timing/references to raw imaging files, procedure timing, clinical follow up, and outcomes.
  • Individual Dataset – Each row represents an individual patient with demographic details and summary clinical data.
  • Gamma Knife Details Dataset – Each row represents an individual gamma knife target to provide further details about available gamma knife radiosurgery.

Dermatology

Dermofit Image Library

The Dermofit Image Library is a collection of 1,300 focal high quality skin lesion images collected under standardised conditions with internal colour standards. The lesions span across ten different classes including melanomas, seborrhoeic keratosis and basal cell carcinomas. Each image has a gold standard diagnosis based on expert opinion (including dermatologists and dermatopathologists). Images consist of a snapshot of the lesion surrounded by some normal skin.The Dermofit Image Library is available under an academic licence. There is a one-off £75 licence fee associated with this product.

 

 

Related publication: Rees, Aldridge, Fisher, Ballerini (2013), A Color and Texture Based Hierarchical K-NN Approach to the Classification of Non-melanoma Skin Lesions, Color Medical Image Analysis, Lecture Notes in Computational Vision and Biomechanics 6 (M. E. Celebi, G. Schaefer (eds.))

 

 

Imaging

VinDr-CXR:An open dataset of chest X-rays with radiologist’s annotations

A dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data,  18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases.

Cardiology/

Pediatrics

EchoNet-Pediatric

The EchoNet-Peds database includes 7,643 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes. The database includes patients ranging from 0-18 years (43% female) with a wide range of sizes.

 

 

Related publication: Reddy CD, Lopez L, Ouyang D, Zou JY, He B. Video-Based Deep Learning for Automated Assessment of Left Ventricular Ejection Fraction in Pediatric Patients. J Am Soc Echocardiogr. 2023 Feb 6:S0894-7317(23)00068-8. doi: 10.1016/j.echo.2023.01.015. Epub ahead of print. PMID: 36754100.

Imaging

BraTS(Brain Tumor Segmentation) data

All BraTS multimodal scans are available as NIfTI files (.nii.gz) which were were acquired with different clinical protocols and various scanners from multiple (n=19) institutions.The overall survival (OS) data, defined in days, are included in a comma-separated value (.csv) file with correspondences to the pseudo-identifiers of the imaging data. The .csv file also includes the age of patients, as well as the resection status.

 

Related publication:B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, et al. “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)”, IEEE Transactions on Medical Imaging 34(10), 1993-2024 (2015) DOI: 10.1109/TMI.2014.2377694

Pathology

CAMELYON data sets: WSI images

The data in this challenge contains whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections.Depending on the particular data set (see below), ground truth is provided:

  • On a lesion-level: with detailed annotations of metastases in WSI.
  • On a patient-level: with a pN-stage label per patient.

All ground truth annotations were carefully prepared under supervision of expert pathologists. WSI are provided as TIFF images. Lesion-level annotations are provided as XML files. For training, 100 patients will be provided and another 100 patients for testing.The test data set contains 500 slides. 1000 slides with 5 slides per patient .

 

Imaging

Chest X-rays (Indiana University)

The dataset contains 7,471 chest X-ray images in .png  file format and 3955 patients radiology text reports available in .XML format. Each image has been paired with four captions such as Impressions, Findings, Comparison and Indication that provide clear descriptions of the salient entities and events.

Original data source : https://openi.nlm.nih.gov/

 

General/

Imaging

Open-i: National Library of Medicine

Open-i provides access to over 3.7 million images from about 1.2 million PubMed Central® articles; 7,470 chest x-rays with 3,955 radiology reports; 67,517 images from NLM History of Medicine collection; and 2,064 orthopedic illustrations.

Imaging

Brain tissue segmentation MRI dataset

A  synthetic dataset of brain images simulated across 42 different MR protocols and based on 500 different reference brains from the Human Connectome Project (HCP) (Van Essen et al., 2012), leading to 21,000 simulated brain images,

Related Publication: You S, Reyes M. Influence of contrast and texture based image modifications on the performance and attention shift of U-Net models for brain tissue segmentation. Frontiers in Neuroimaging. 2022;1.

Imaging

The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset

An open-source data collection consisting a total of 955 T1-weighted MRIs (Magnetic Resonance Imaging) with manually segmented diverse lesions and metadata

Related publication: Liew, Sook-Lei. The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset – Release 2.0, 2021. Inter-university Consortium for Political and Social Research [distributor], 2022-08-08. https://doi.org/10.3886/ICPSR36684.v5

Cancer/

Imaging

Breast Cancer MRI Dataset: Duke

The dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients, over a decade, having the following data components:

  1. Demographic, clinical, pathology, treatment, outcomes, and genomic data: Collected from a variety of sources including clinical notes, radiology report, and pathology reports.
  2. Pre-operative dynamic contrast enhanced (DCE)-MRI: Downloaded from PACS systems and de-identified for The Cancer Imaging Archive (TCIA) release in DICOM format.
  3. Locations of lesions in DCE-MRI: Annotations on the DCE-MRI images by radiologists.
  4. Imaging features from DCE-MRI: A set of 529 computer-extracted imaging features by inhouse software.

Related publication: Saha, A., Harowicz, M.R., Grimm, L.J., Kim, C.E., Ghate, S.V., Walsh, R. and Mazurowski, M.A., 2018. A machine learning approach to radiogenomics of breast cancer.

General

National Health and Nutrition Examination Survey (NHANES) Data

The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The survey examines a nationally representative sample of about 5,000 persons each year. Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases.

General

Protective Policy Index (PPI) global dataset for COVID-19

This is an original dataset of stringency of public health policy measures that were adopted in response to COVID-19 worldwide by governments at national and sub-national levels. The data set covers governments’ policy responses between January 24, 2020 and December 31, 2020.

Related publication: Shvetsova, O., Zhirnov, A., Adeel, A.B. et al. Protective Policy Index (PPI) global dataset of origins and stringency of COVID 19 mitigation policies. Sci Data 9, 319 (2022). https://doi.org/10.1038/s41597-022-01437-9

Cardiology/

General/

Pathology

Nightingale Open Science Datasets

Multiple datasets available:

  1. silent-cchs-ecgDiagnosing ‘silent’ heart attack (48,000 ECG waveforms)
  2. brca-psj-pathIdentifying high-risk breast cancer (175,000 biopsy slides)
  3. arrest-ntuh-ecgSubtyping cardiac arrest (24,106 ECG waveforms)
  4. fracture-aimi-xrayPredicting fractures (64,000 chest x-rays)
  5. covid-psj-xrayEmergency triage of Covid-19 patients (7,500 chest x-rays)

General/

Pulmonary

COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening

A dataset consisting of 53,449 audio samples (over 552 hours in total) crowd-sourced from 36,116 participants through our COVID-19 Sounds app. It also provides participants’ self-reported COVID-19 testing status with 2,106 samples tested positive.

 

Related publication: COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening

Imaging

RadGraph: Extracting Clinical Entities and Relations from Radiology Reports

This dataset contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Additionally,there is an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs.

 

Related publication: Jain, S., Agrawal, A., Saporta, A., Truong, S. Q., Nguyen Duong, D., Bui, T., Chambon, P., Lungren, M., Ng, A., Langlotz, C., & Rajpurkar, P. (2021). RadGraph: Extracting Clinical Entities and Relations from Radiology Reports (version 1.0.0). PhysioNethttps://doi.org/10.13026/hm87-5p47.

 

General

Papers with code medical datasets

200+ datasets of various types with links and papers.Includes search options for datatypes, language and more.

Dermatology

PH² – a dermoscopic image database

The PH² database includes the manual segmentation, the clinical diagnosis, and the identification of several dermoscopic structures, performed by expert dermatologists, in a set of 200 dermoscopic images.

 

Related publication: Mendonca T, Ferreira PM, Marques JS, Marcal AR, Rozeira J. PH² – a dermoscopic image database for research and benchmarking. Annu Int Conf IEEE Eng Med Biol Soc. 2013;2013:5437-40. doi: 10.1109/EMBC.2013.6610779. PMID: 24110966

General

VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection

Vision-based Fallen Person (VFP290K) dataset consists of 294,713 frames of fallen persons extracted from 178 videos, including 131 scenes in 49 locations. It demonstrated the effectiveness of the features through extensive experiments analyzing the performance shift based on object detection models.

 

Related publication: VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection

Critical Care

HiRID, a high time-resolution ICU dataset

HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand adult patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland (ICU), an interdisciplinary 60-bed unit admitting >6,500 patients per year. The dataset contains de-identified demographic information and a total of 681 routinely collected physiological variables, diagnostic test results and treatment parameters from almost 34 thousand admissions during the period from January 2008 to June 2016. Data is stored with a uniquely high time resolution of one entry every two minutes.

 

Related publication: Faltys, M., Zimmermann, M., Lyu, X., Hüser, M., Hyland, S., Rätsch, G., & Merz, T. (2021). HiRID, a high time-resolution ICU dataset (version 1.1.1). PhysioNethttps://doi.org/10.13026/nkwc-js72.

Critical Care

The eICU Collaborative Research Database

eICU Collaborative Research Database, a multi-center intensive care unit (ICU)database with high granularity data for over 200,000 admissions to ICUs monitored by eICU Programs across the United States.

 

Related publication: The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG and Badawi O. Scientific Data (2018). DOI: http://dx.doi.org/10.1038/sdata.2018.178.

Critical Care

MIMIC -IV

The Medical Information Mart for Intensive Care (MIMIC)-IV database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC).

 

Related publication: Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2020). MIMIC-IV (version 0.4). PhysioNethttps://doi.org/10.13026/a3wn-hq05.

General/

Neurology/

Ophthalomology

EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction

A dataset of paired Electroencephalography (EEG) and video-infrared eye tracking (ET) recordings from 356 subjects for more than 47 hours in total. A benchmark consisting of 3 evaluation tasks with increasing difficulty is introduced alongside the dataset.

 

Related publication: EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction

Anesthesiology/

General/

Neurology

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management

Q-Pain, a dataset for assessing bias in medical QA in the context of pain management. 55 medical question-answer pairs across five different types of pain management: each question includes a detailed patient-specific medical scenario (“vignette”) designed to enable the substitution of multiple different racial and gender “profiles” and to evaluate whether bias is present when answering whether or not to prescribe medication.

 

Related publication: Logé, C., Ross, E., Dadey, D. Y. A., Jain, S., Saporta, A., Ng, A., & Rajpurkar, P. (2021). Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management (version 1.0.0). PhysioNethttps://doi.org/10.13026/2tdv-hj07.

Imaging

Chest ImaGenome Dataset

Dataset contributes significantly to the research community by providing 1) 1,256 combinations of relation annotations between 29 CXR anatomical locations (objects with bounding box coordinates) and their attributes, structured as a scene graph per image, 2) over 670,000 localized comparison relations (for improved, worsened, or no change) between the anatomical locations across sequential exams, as well as 3) a manually annotated gold standard scene graph dataset from 500 unique patients.

 

Related publication: Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNethttps://doi.org/10.13026/wv01-y230.

General

Therapeutics Data Commons (TDC)

TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and diverse types of data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards.

 

Related publication: Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Imaging

Report-Annotated Duke Chest CT (RAD-ChestCT)

The RAD-ChestCT dataset is a imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset.

 

 

Related publication: Draelos et al., “Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes,” Medical Image Analysis 2021. DOI: 10.1016/j.media.2020.101857

Dermatology

MED-NODE

A dataset consists of 70 melanoma and 100 naevus images from the digital image archive of the Department of Dermatology of the University Medical Center Groningen (UMCG) used for the development and testing of the MED-NODE system for skin cancer detection from macroscopic images. The file contains 170 images (70 melanoma and 100 nevi cases).

 

Related publications: I. Giotis, N. Molders, S. Land, M. Biehl, M.F. Jonkman and N. Petkov: “MED-NODE: A computer-assisted melanoma diagnosis system using non-dermoscopic images”, Expert Systems with Applications, 42 (2015), 6578-6585

General

BigBIO: Biomedical NLP datasets

BIGBIO a community library of 126+ biomedical NLP datasets currently covering 12 task categories and 10+ languages with • programmatic access. BIGBIO enables reproducible data-centric machine learning workflows, by focusing on programmatic access to datasets and their metadata in a uniform format.

 

Related Publication: BIGBIO: A Framework for Data-Centric Biomedical Natural Language Processing

Dermatology

PAD-UFES-20: a skin lesion dataset collected from smartphones

The dataset consists of 2,298 samples of six different types of skin lesions. Each sample consists of a clinical image and up to 22 clinical features including the patient’s age, skin lesion location, Fitzpatrick skin type, and skin lesion diameter. ll BCC, SCC, and MEL are biopsy-proven.In total, there are 1,373 patients, 1,641 skin lesions, and 2,298 images present in the dataset. The remaining ones may have clinical diagnosis according to a consensus of a group of dermatologists. In total, approximately 58% of the samples in this dataset are biopsy-proven.

 

Related publication: PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones

General/

Microbiology

International Severe Acute Respiratory and Emerging Infection Consortium (ISARIC) COVID-19 dataset

The database includes data from more than 705,000 patients, collected in more than 60 countries and 1,500 centres worldwide. Patient data are available from acute hospital admissions with COVID-19 and outpatient follow-ups. The data include signs and symptoms, pre-existing comorbidities, vital signs, chronic and acute treatments, complications, dates of hospitalization and discharge, mortality, viral strains, vaccination status, and other data.

 

 

Related publication: ISARIC-COVID-19 dataset: A Prospective, Standardized, Global Dataset of Patients Hospitalized with COVID-19

Dermatology

SNU dataset

2201 images with diagnoses based on biopsy or clinical impression.174 disease classes for the model training.

 

 

General

BioRED: a rich biomedical relation extraction dataset

Biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of 600 PubMed abstracts.

 

 

Related dataset: Ling Luo, et al. BioRED: a rich biomedical relation extraction dataset, Briefings in Bioinformatics, 2022

Imaging

RadImageNet

The RadImageNet database includes 1.35 million annotated CT, MRI, and ultrasound images of musculoskeletal, neurologic, oncologic, gastrointestinal, endocrine, and pulmonary pathology. The RadImageNet database contains medical images of 3 modalities, 11 anatomies, and 165 pathologic labels.

Imaging

BRAX, a Brazilian labeled chest X-ray dataset

BRAX dataset provides 40,967 images, 24,959 imaging studies for 19,351 patients presenting to the Hospital Israelita Albert Einstein. All images have been verified by trained radiologists and de-identified to protect patient privacy. Fourteen labels were derived from free-text radiology reports written in Brazilian Portuguese using Natural Language Processing.

 

 

Related publication: BRAX, a Brazilian labeled chest X-ray dataset

Imaging

MONAI: Medical Open Network for Artificial Intelligence

The MONAI framework is the open-source foundation being created by Project MONAI. MONAI is a freely available, community-supported, PyTorch-based framework for deep learning in healthcare imaging.Project MONAI also includes MONAI Label, an intelligent open source image labeling and learning tool that helps researchers and clinicians collaborate, create annotated datasets, and build AI models in a standardized MONAI paradigm.

Imaging

UPENN-GBM: MRI scans for Glioblastoma (GBM) patients

This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format.

630 patients, 3301 studies, 820,000 + images.

General/

Imaging/

Pathology/

Surgery

Grand Challenge: Image analysis datasets and algorithms

A platform for end-to-end development of machine learning solutions in biomedical imaging.Grand Challenge was developed in 2010 to make it easy for organizers of challenges to set up a website for a particular challenge and to bring all information on challenges in the domain of biomedical image analysis available at one place.This system has been operational since 2017 and has been used by over 300 challenges,70,000 users with more than 1000 algorithms.

Dermatology

Seven-Point Checklist Dermatology Dataset

A database for evaluating computerized image-based prediction of the 7-point skin lesion malignancy checklist. The dataset includes over 2000 clinical and dermoscopy color images, along with corresponding structured metadata tailored for training and evaluating computer aided diagnosis (CAD) systems.

 

Related publication: J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “Seven-Point Checklist and Skin Lesion Classification using Multitask Multimodal Neural Nets,” IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 2, pp. 538–546, 2019.

Imaging/

Neurology

OpenNeuroDatasets

A free and open platform for validating and sharing BIDS-compliant MRIPETMEGEEG, and iEEG data.720 public datasets and growing.

 

 

Webpage: https://openneuro.org/

Cardiology

EchoNet – LVH

The EchoNet-LVH dataset includes 12,000 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac chamber size and wall thickness.

 

 

Related publication: High-Throughput Precision Phenotyping of Left Ventricular Hypertrophy with Cardiovascular Deep Learning

Imaging

Japanese Society of Radiological Technology (JSRT) database

The database includes 154 conventional chest radiographs with a lung nodule (100 malignant and 54 benign nodules) and 93 radiographs without a nodule  The database also includes additional information such as; patient age, gender, diagnosis (malignant or benign), X and Y coordinates of nodule, simple diagram of nodule location. Lung nodule images were classified into five groups according to the degrees of subtlety.

 

Related publication:  Shiraishi J, Katsuragawa S, lkezoe J, et al: Development of a digital image database for chest radiographs with and without a lung nodule: Receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. AJR 174:71-74, 2000.

Anesthesiology

Behavioral and autonomic dynamics during propofol-induced unconsciousness dataset

Data was collected from nine healthy volunteers during a study of propofol-induced unconsciousness.  For all subjects, approximately 3 hours of data were recorded while using target-controlled infusion protocol.Data includes continuous electrocardiogram (ECG); interventions included in the study for patient safety, such as administering phenylephrine (a vasopressor);heart rate variability (HRV) and electrodermal activity (EDA).

 

Related publication: Subramanian, S., Purdon, P., Barbieri, R., & Brown, E. (2021). Behavioral and autonomic dynamics during propofol-induced unconsciousness (version 1.0). PhysioNethttps://doi.org/10.13026/2rbc-1r03.

Ophthalomology

A global review of publicly available datasets for ophthalmological imaging

94 open access ophthalmological imaging datasets containing 507 724 images and 125 videos from 122 364 patients.

Cardiology

PTB-XL: EKG dataset

The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. Total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements.

 

Related publication: Wagner, P., Strodthoff, N., Bousseljot, R.-D., Kreiseler, D., Lunze, F.I., Samek, W., Schaeffter, T. (2020), PTB-XL: A Large Publicly Available ECG Dataset. Scientific Data. https://doi.org/10.1038/s41597-020-0495-6

Cardiology/

Dermatology/

General/

Imaging

Stanford AIMI Shared Datasets

A collection of de-identified annotated medical imaging data to foster transparent and reproducible collaborative research. X-rays, CT scans, MRIs,Echocardiography and Dermatology images.

Dermatology

DDI – Diverse Dermatology Images: Stanford AIMI Dataset

Diverse Dermatology Images (DDI) dataset—the first publicly available, deeply curated, and pathologically confirmed image dataset with diverse skin tones. The DDI was retrospectively selected from reviewing pathology reports in Stanford Clinics from 2010-2020. It has a total of 656 images representing 570 unique patients.

General

Huggingface datasets

Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.Currently over 2658 datasets, and more than 34 metrics available.At least 13 datasets with “medical” term search.Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model.

Pulmonary

DCSM Sleep Staging Dataset

The DCSM dataset consists of 255 randomly selected and fully anonymized overnight lab-based PSG recordings from patients visiting the DCSM for the diagnosis of non-specific sleep related disorders. The DCSM dataset represents a diverse cohort of Danish patients with respect to demographic characteristics, diagnostic background and sleep/non-sleep related medication usage, collected between 2015-2018.

 

Pulmonary

Dreem Open Datasets

Two publicly-available datasets, DOD-H including 25 healthy volunteers and DOD-O including 55 patients suffering from obstructive sleep apnea (OSA). Both datasets have been scored by 5 sleep technologists from different sleep centers. We developed a framework to compare automated approaches to a consensus of multiple human scorers.

 

Related publication: A. Guillot, F. Sauvet, E. H. During and V. Thorey, “Dreem Open Datasets: Multi-Scored Sleep Datasets to Compare Human and Automated Sleep Staging,” in IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 28, no. 9, pp. 1955-1965, Sept. 2020, doi: 10.1109/TNSRE.2020.3011181.

Cardiology/

Imaging

Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (M&Ms) Dataset

375 heterogeneous cardiac magnetic resonance (CMR) datasets acquired by using four different scanner vendors in six hospitals and three different countries (Spain, Canada and Germany).

 

Related publication: V. M. Campello et al., “Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation: The M&Ms Challenge,” in IEEE Transactions on Medical Imaging, vol. 40, no. 12, pp. 3543-3554, Dec. 2021, doi: 10.1109/TMI.2021.3090082.

Cancer/

Genetics/

Imaging

The Cancer Imaging Archive ( TCAI) dataset collection

TCIA data are organized as “collections”; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as patient outcomes, treatment details, genomics and image analyses are also provided when available. Over  100+ datasets, many of which are public.

General

n2c2 NLP Research Data Sets

Unstructured notes from the Research Patient Data Registry at Partners Healthcare,Boston,USA (originally developed during the i2b2 project). Clinical Natural Language Processing (NLP) data sets were originally created at a former NIH-funded National Center for Biomedical Computing (NCBC) known as i2b2: Informatics for Integrating Biology and the Bedside. Beginning in 2018, they are officially known as n2c2 (National NLP Clinical Challenges).

General

emrQA dataset

A publicly available EMR Question Answering (QA) corpus by creating a large-scale dataset, emrQA, using a novel semi-automated generation framework that allows for minimal expert involvement and re-purposes existing annotations available for other clinical NLP tasks.EmrQA has 1 million question-logical form and 400,000+ question answer evidence pairs. The dataset uses existing NLP task annotations from the i2b2 Challenge datasets.

 

 

Related publication: Pampari, A., Raghavan, P., Liang, J.J., & Peng, J. (2018). emrQA: A Large Corpus for Question Answering on Electronic Medical Records. EMNLP.

Anesthesiology

VSCapture: An open source tool for Data acquisition from anesthesia monitor

VSCapture, an open source tool developed in C# programming language on the .NET/Mono platform that allows the tool to run on Windows, Macintosh OS X, Linux Ubuntu operating systems.

 

Related Publication: Data acquisition from S/5 GE Datex anesthesia monitor using VSCapture.

 

Related Dataset: The University of Queensland Vital Signs Dataset.

 

The University of Queensland Vital Signs Dataset contains a wide range of patient monitoring data and vital signs that were recorded during 32 surgical cases where patients underwent anaesthesia at the Royal Adelaide Hospital.

Cancer/

Pathology

Prostate cANcer graDe Assessment (PANDA) Challenge dataset

12,625 whole-slide images (WSIs) of prostate biopsies were available for model development (the development set), 393 for performance evaluation during the competition phase (the tuning set), 545 as the internal validation set in the postcompetition phase and 1,071 for external validation from 6 different sites.

 

Related publication: Bulten, W., Kartasalo, K., Chen, PH.C. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat Med (2022). https://doi.org/10.1038/s41591-021-01620-2

Cardiology/

General

Hero DMC Heart Institute(HDHI): Hospital admissions dataset

This is a dataset from tertiary care medical college and hospital in India’s cardiology unit which had 14,845 admissions corresponding to 12,238 patients.

 

Related publication: Bollepalli, S.C.; Sahani, A.K.; Armoundas, A.A. ,et al. An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission to a Cardiac Unit. Diagnostics 2022, 12, 241.

https://doi.org/10.3390/diagnostics12020241

 

Dermatology

International Skin Imaging Collaboration(ISIC) Dataset

The dataset included over 69,000 dermatology images.International Skin Imaging Collaboration (ISIC) is a global partnership that has organized the world’s largest repository of publicly available dermoscopic images, hosted the first public benchmarks for melanoma detection in dermoscopic images, titled “Skin Lesion Analysis Towards Melanoma Detection”, at the IEEE International Symposium of Biomedical Imaging (ISBI).

Imaging

CQ500 dataset

A dataset of 491 Head CT scans with 193,317 slices, anonymized dicoms for all the scans and the corresponding radiologists’ reads done by three radiologists with an experience of 8, 12 and 20 years in cranial CT interpretation respectively.

 

Related publication: Development and Validation of Deep Learning Algorithms for Detection of Critical Findings in Head CT scan.

Critical Care/

Imaging

COVID-Net

Publicly available  suite of tailored deep neural network models for tackling different challenges ranging from screening to risk stratification to treatment planning for patients with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 

  • Chest x-rays: 16,352 CXR images across 14,979 patients Click here
  • Chest CT: 201,103 CT slices from 4,501 patients Click here
  • Chest point-of-care ultrasound: 29,651 POCUS images Click here
  • COVID-Net ICU:1925 records from 385 patients   Click here

Also,expanded to open source TB-Net initiative for tuberculosis screening, Fibrosis-Net initiative for pulmonary fibrosis progression prediction, and Cancer-Net initiative for cancer screening.

Emergency Department

MIMIC-IV-ED

MIMIC-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2016. 448,972 ED stays with vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses available

Imaging

RICORD: RSNA International COVID-19 Open Annotated Radiology Database

This database is the first multi-institutional, multi-national expert annotated COVID-19 imaging dataset.Annotated by three radiologists with majority vote adjudication by board certified radiologists,RICORD consists of 240 thoracic CT scans and 1,000 chest radiographs contributed from four international sites.

Anesthesiology

VItalDb dataset

A comprehensive dataset of 6,388 surgical patients composed of intraoperative biosignals and clinical information from the Department of Anesthesiology and Pain Medicine, Seoul National University College of Medicine, Seoul, Korea .

Pathology

NuCLS

The NuCLS dataset contains over 220,000 labeled nuclei from breast cancer images from The Cancer Genome Atlas( TCGA). These nuclei were annotated through the collaborative effort of pathologists, pathology residents, and medical students.

Imaging

CheXpert

CheXpert is a  public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients from Stanford Hospital.

Cancer/

Genetics

Genomic Data Commons(GDC) datasets

The GDC Portal is a platform from National Cancer Institute(NCI) with cancer related genomic data for 80,000+ cases.

Imaging

BIMCV-COVID19 Imaging Datasets

BIMCV-COVID19+ dataset is a large dataset with chest X-ray images  and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV).These iterations of the database include 7377 CR, 9463 DX and 6687 CT studies.

Imaging

VinBigData Chest X-ray abnormalities detection

Provided on Kaggle by the Vingroup Big Data Institute (VinBigData) aims to promote fundamental research and investigate novel and highly-applicable technologies.A dataset consisting of 18,000 images that have been annotated by experienced radiologists.

Cardiology

EchoNet -Dynamic

The EchoNet-Dynamic database includes 10,030 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes.

 

Related publication: Video-based AI for beat-to-beat assessment of cardiac function

Genetics/

Pharmacology

PGxCorpus: a Manually Annotated Corpus for Pharmacogenomics

941 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly genes variations, gene, drugs and phenotypes), and relationships between those.

General

CENTAUR LABS

40+ speciality classified list of open source datasets for healthcare with direct links to the datasets and more information.

General

DATA WORLD – HEALTHCARE

More than a 100 healthcare related datasets from around the world, classified and annotated.

General

Determinants of COVID-19 mortality in the United States dataset (BrainX)

Dataset created for the purpose of continuing research into COVID-19. However with information from all 50 states and the District of Columbia, many US statistics can be compared.

Pharmacology

Drug Induced Liver injury(DILI) Dataset

The DILIrank dataset is an updated version of the LTKB Benchmark dataset. DILIrank consists of 1,036 FDA-approved drugs that are divided into four classes according to their potential for causing drug-induced liver injury (DILI).

Ophthalomology

SUSTech -SYSU dataset

Dataset for automatically segmenting and classifying corneal ulcers with 712 ocular staining images and the associated segmentation labels for flaky corneal ulcers.

General

Harvard Dataverse

4000+ healthcare datasets made available from Harvard University.Searchable and diverse.

Pathology

PanNUke Dataset

Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources.

Imaging

ACR COVID-19 Imaging Dataset

A dataset with Images,mainly Chest X-rays from COVID-19 patients.

General

C3.ai COVID-19 Data Lake

Multiple data sources for COVID-19 in a unified data model, ready for analysis at one place.

General

COVID-19 Open Research Dataset Challenge (CORD-19)

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

General

Novel Corona Virus 2019 Dataset

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.

The data is available since 22 Jan, 2020.

Imaging

The RSNA 2019 Brain CT Hemorrhage Dataset.

Largest collection of Intracranial hemorrhage CT scans.874 035 images with expert annotations.

 

Reference: Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge

Cardiology/

General/

Neurology

PHYSIONET(MIMIC/eICU Collaborative)

One of the most comprehensive source of many datasets in healthcare.Primarily from ICU patients.

https://physionet.org/about/database/

MIMIC – IV Dataset (https://physionet.org/content/mimiciv/0.4/)

Includes:

  • Clinical datasets such as MIMIC,eICU collaborative and Pediatic ICU datasets.
  • Waveform datasets with ECG,EEG,arterial blood pressure waveform.
  • ECG datasets with various pathophysiologic changes and drug interactions.
  • Fetal datasets including sounds and ECG.
  • Gait and Balance datasets include gait dynamics for patients with various neurodegenerative disorders.
  • Neuro and Myoelectic datasets with EEG,EMG and evoked potential waveforms.
  • Image datasets with Chest X-rays and MRI images.
  • Computed Tomography Images for Intracranial Hemorrhage Detection and Segmentation
  • Miscellaneous datasets with text, language,posture and other datasets

Imaging/

Neurology

ADNI Database

Alzheimer’s disease patient’s imaging(MRI), clinical, genomic, and biomarker data for the purposes of scientific investigation, teaching, or planning clinical research studies.

http://adni.loni.usc.edu/data-samples/access-data/

Ophthalomology

RIM-ONE

RIM-ONE is a database for optic disc and cup segmentation evaluation by Medical Image Analysis group.

Critical Care

AmsterdamUMCdb

Contains data related to 23,376 intensive care unit and high dependency unit admissions at Amsterdam University Medical Center of adult patients from 2003-2016.

 

Pharmacology

FDA Adverse Event Reporting System (FAERS)

The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA

Microbiology

Malaria Dataset

A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.The dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells.

Ophthalomology

RIGA Dataset :Retinal fundus images for glaucoma analysis

A de-identified dataset of retinal fundus images for glaucoma analysis (RIGA) derived from three sources with 750 original images and 4500 manual marked images

 

Ophthalomology

High-Resolution Fundus (HRF) Image Database

The public database contains 15 images of healthy patients, 15 images of patients with diabetic retinopathy and 15 images of glaucomatous patients.

Ophthalomology

DR HAGIS:Diabetic Retinopathy, Hypertension, Age-related macular degeneration and Glacuoma ImageS

39 images for development of vessel extraction algorithms suitable for retinal screening programmes.

Cancer

NLST Datasets: National Cancer Institute

Datasets from National Cancer Institute of over 54000 patients. They include data on participant characteristics, screening exam results, diagnostic procedures, lung cancer, and mortality. Images from over 75,000 CT screening exams are available. Over 1,200 pathology images from a subset of NLST lung cancer patients (~500 of over 2,000 patients) may be viewed.

Pulmonary

NSRR Datasets:National Sleep Research Resource

Polysomnography dataset from NSRR for sleep studies.Large collection of deidentified physiologic signals perfect for ML development.

Dermatology

The HAM10000 dataset

A large collection of multi-source dermatoscopic images of common pigmented skin lesions containing 10000 images.

Related publication:The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

General

UCI Machine Learning Repository

This open source repository has more than 400 datasets including healthcare(100+) and non-healthcare ones in searchable and categorized format.

General

Centers for Medicare and Medicaid(CMS) datasets with ResDAC link.

CMS datasets provide US Medicare and Medicaid datasets.

ResDAC(The Research Data Assistance Center) provides free support to users of CMS datasets.Link:  https://www.resdac.org/learn

General

Center for disease control(CDC) Datasets

Center for Disease Control’s datasets.Useful for incidence,prevalance of various disorders and mortality data from across the US.

General

Healthcare Cost and Utilization Project (HCUP) datasets

Agency for Healthcare Research and Quality’s HCUP datasets used to identify, track, and analyze US national trends in health care utilization, access, charges, quality, and outcomes.

General

NHS datasets

UK government’s National Health services datasets.NHS choices datasets are useful for NLP and sentiment analysis both for GPs and hospitals.

Imaging

OASIS Brain MRI dataset

Brain MRI datasets from Open Accesss series of Imaging Studies(OASIS).

Cancer

National Cancer Institute(NCI)-SEER datasets

Cancer epidemiology data available through NCI’s Surveillance,Epidemiology and End Result Program(SEER).

Cancer/

Genetics

BROAD Institute’s Cancer program datasets

Cancer and genomics datasets.

Imaging

MURA

A dataset of 14,000+ anonymized, radiologist labeled musculoskeletal X-rays from 12,000+ patients from Stanford ML group.

 

Related publication: https://arxiv.org/abs/1712.06957

Imaging

fastMRI

1500+ knee MRI anonymized dataset from NYU.

General

NLTK : Natural language toolkit

One stop to learn Natural Language processing and more.

Related publication: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

General

DAIR.AI

An excellent resource for trends and updates in AI, especially NLP by Elvis Saravia.

General

Data science article collection

An excellent collection of articles on data science.

General

Google Dataset Search

Google’s powerful search engine to assist with dataset search.

Imaging

NIH CXR14 dataset

Over 100,000 anonymized chest x-ray images and their corresponding data from more than 30,000 patients, including many with advanced lung disease.

Imaging

NIH Deep Lesion

NIH release of  a dataset containing 32,000 CT scan images with annotated lesions  belonging to 4400 unique patients.

General

Blue Button 2.0

A CMS initiative to democratize research and development using beneficiary data.Greater than 70 million patient dataset available.

General

National Institute of Health

The link below is for NIH’s strategic plan for data science in healthcare.A must read for anyone using data in healthcare for research and innovation

Imaging

NIH Clinical Center

Largest open source Chest X-Ray data set available through NIH’s clinical center.See the link in the article to access the data.Also available through GITHUB and KAGGLE.

General

GITHUB

One of the the largest and most advanced software development platform in the world with many datasets and repositories.

General

KAGGLE

Kaggle is a great resource for de-identified datasets in healthcare.

General

DataMed

A biomedical data search engine which searches for datasets across registries.

General

Mendeley

A place to store, share or find data.A platform for biomedical  research.

General

Nature

Detailed data repositories for biomedical research especially proteins and genetics.