Generic selectors
Exact matches only
Search in title
Search in content
post
page

Imaging

Japanese Society of Radiological Technology (JSRT) database

The database includes 154 conventional chest radiographs with a lung nodule (100 malignant and 54 benign nodules) and 93 radiographs without a nodule  The database also includes additional information such as; patient age, gender, diagnosis (malignant or benign), X and Y coordinates of nodule, simple diagram of nodule location. Lung nodule images were classified into five groups according to the degrees of subtlety.

 

Related publication:  Shiraishi J, Katsuragawa S, lkezoe J, et al: Development of a digital image database for chest radiographs with and without a lung nodule: Receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. AJR 174:71-74, 2000.

Anesthesiology

Behavioral and autonomic dynamics during propofol-induced unconsciousness dataset

Data was collected from nine healthy volunteers during a study of propofol-induced unconsciousness.  For all subjects, approximately 3 hours of data were recorded while using target-controlled infusion protocol.Data includes continuous electrocardiogram (ECG); interventions included in the study for patient safety, such as administering phenylephrine (a vasopressor);heart rate variability (HRV) and electrodermal activity (EDA).

 

Related publication: Subramanian, S., Purdon, P., Barbieri, R., & Brown, E. (2021). Behavioral and autonomic dynamics during propofol-induced unconsciousness (version 1.0). PhysioNethttps://doi.org/10.13026/2rbc-1r03.

Cardiology

PTB-XL: EKG dataset

The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. Total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements.

 

Related publications: Wagner, P., Strodthoff, N., Bousseljot, R.-D., Kreiseler, D., Lunze, F.I., Samek, W., Schaeffter, T. (2020), PTB-XL: A Large Publicly Available ECG Dataset. Scientific Data. https://doi.org/10.1038/s41597-020-0495-6

Dermatology

DDI – Diverse Dermatology Images: Stanford AIMI Dataset

Diverse Dermatology Images (DDI) dataset—the first publicly available, deeply curated, and pathologically confirmed image dataset with diverse skin tones. The DDI was retrospectively selected from reviewing pathology reports in Stanford Clinics from 2010-2020. It has a total of 656 images representing 570 unique patients.

General

Huggingface datasets

Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.Currently over 2658 datasets, and more than 34 metrics available.At least 13 datasets with “medical” term search.Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model.

Pulmonary

DCSM Sleep Staging Dataset

The DCSM dataset consists of 255 randomly selected and fully anonymized overnight lab-based PSG recordings from patients visiting the DCSM for the diagnosis of non-specific sleep related disorders. The DCSM dataset represents a diverse cohort of Danish patients with respect to demographic characteristics, diagnostic background and sleep/non-sleep related medication usage, collected between 2015-2018.

 

Pulmonary

Dreem Open Datasets

Two publicly-available datasets, DOD-H including 25 healthy volunteers and DOD-O including 55 patients suffering from obstructive sleep apnea (OSA). Both datasets have been scored by 5 sleep technologists from different sleep centers. We developed a framework to compare automated approaches to a consensus of multiple human scorers.

 

Related publication: A. Guillot, F. Sauvet, E. H. During and V. Thorey, “Dreem Open Datasets: Multi-Scored Sleep Datasets to Compare Human and Automated Sleep Staging,” in IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 28, no. 9, pp. 1955-1965, Sept. 2020, doi: 10.1109/TNSRE.2020.3011181.

Cardiology/

Imaging

Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (M&Ms) Dataset

375 heterogeneous cardiac magnetic resonance (CMR) datasets acquired by using four different scanner vendors in six hospitals and three different countries (Spain, Canada and Germany).

 

Related publication: V. M. Campello et al., “Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation: The M&Ms Challenge,” in IEEE Transactions on Medical Imaging, vol. 40, no. 12, pp. 3543-3554, Dec. 2021, doi: 10.1109/TMI.2021.3090082.

Cancer/

Genetics/

Imaging

The Cancer Imaging Archive ( TCAI) dataset collection

TCIA data are organized as “collections”; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as patient outcomes, treatment details, genomics and image analyses are also provided when available. Over  100+ datasets, many of which are public.

General

n2c2 NLP Research Data Sets

Unstructured notes from the Research Patient Data Registry at Partners Healthcare,Boston,USA (originally developed during the i2b2 project). Clinical Natural Language Processing (NLP) data sets were originally created at a former NIH-funded National Center for Biomedical Computing (NCBC) known as i2b2: Informatics for Integrating Biology and the Bedside. Beginning in 2018, they are officially known as n2c2 (National NLP Clinical Challenges).

General

emrQA dataset

A publicly available EMR Question Answering (QA) corpus by creating a large-scale dataset, emrQA, using a novel semi-automated generation framework that allows for minimal expert involvement and re-purposes existing annotations available for other clinical NLP tasks.EmrQA has 1 million question-logical form and 400,000+ question answer evidence pairs. The dataset uses existing NLP task annotations from the i2b2 Challenge datasets.

 

 

Related publication: Pampari, A., Raghavan, P., Liang, J.J., & Peng, J. (2018). emrQA: A Large Corpus for Question Answering on Electronic Medical Records. EMNLP.

Anesthesiology

VSCapture: An open source tool for Data acquisition from anesthesia monitor

VSCapture, an open source tool developed in C# programming language on the .NET/Mono platform that allows the tool to run on Windows, Macintosh OS X, Linux Ubuntu operating systems.

 

Related Publication: Data acquisition from S/5 GE Datex anesthesia monitor using VSCapture.

 

Related Dataset: The University of Queensland Vital Signs Dataset.

 

The University of Queensland Vital Signs Dataset contains a wide range of patient monitoring data and vital signs that were recorded during 32 surgical cases where patients underwent anaesthesia at the Royal Adelaide Hospital.

Cancer/

Pathology

Prostate cANcer graDe Assessment (PANDA) Challenge dataset

12,625 whole-slide images (WSIs) of prostate biopsies were available for model development (the development set), 393 for performance evaluation during the competition phase (the tuning set), 545 as the internal validation set in the postcompetition phase and 1,071 for external validation from 6 different sites.

 

Related publication: Bulten, W., Kartasalo, K., Chen, PH.C. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat Med (2022). https://doi.org/10.1038/s41591-021-01620-2

Cardiology/

General

Hero DMC Heart Institute(HDHI): Hospital admissions dataset

This is a dataset from tertiary care medical college and hospital in India’s cardiology unit which had 14,845 admissions corresponding to 12,238 patients.

 

Related publication: Bollepalli, S.C.; Sahani, A.K.; Armoundas, A.A. ,et al. An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission to a Cardiac Unit. Diagnostics 2022, 12, 241.

https://doi.org/10.3390/diagnostics12020241

 

Dermatology

International Skin Imaging Collaboration(ISIC) Dataset

The dataset included over 69,000 dermatology images.International Skin Imaging Collaboration (ISIC) is a global partnership that has organized the world’s largest repository of publicly available dermoscopic images, hosted the first public benchmarks for melanoma detection in dermoscopic images, titled “Skin Lesion Analysis Towards Melanoma Detection”, at the IEEE International Symposium of Biomedical Imaging (ISBI).

Imaging

CQ500 dataset

A dataset of 491 Head CT scans with 193,317 slices, anonymized dicoms for all the scans and the corresponding radiologists’ reads done by three radiologists with an experience of 8, 12 and 20 years in cranial CT interpretation respectively.

 

Related publication: Development and Validation of Deep Learning Algorithms for Detection of Critical Findings in Head CT scan.

Critical Care/

Imaging

COVID-Net

Publicly available  suite of tailored deep neural network models for tackling different challenges ranging from screening to risk stratification to treatment planning for patients with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 

  • Chest x-rays: 16,352 CXR images across 14,979 patients Click here
  • Chest CT: 201,103 CT slices from 4,501 patients Click here
  • Chest point-of-care ultrasound: 29,651 POCUS images Click here
  • COVID-Net ICU:1925 records from 385 patients   Click here

Also,expanded to open source TB-Net initiative for tuberculosis screening, Fibrosis-Net initiative for pulmonary fibrosis progression prediction, and Cancer-Net initiative for cancer screening.

Emergency Department

MIMIC-IV-ED

MIMIC-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2016. 448,972 ED stays with vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses available

Imaging

RICORD: RSNA International COVID-19 Open Annotated Radiology Database

This database is the first multi-institutional, multi-national expert annotated COVID-19 imaging dataset.Annotated by three radiologists with majority vote adjudication by board certified radiologists,RICORD consists of 240 thoracic CT scans and 1,000 chest radiographs contributed from four international sites.

Anesthesiology

VItalDb dataset

A comprehensive dataset of 6,388 surgical patients composed of intraoperative biosignals and clinical information from the Department of Anesthesiology and Pain Medicine, Seoul National University College of Medicine, Seoul, Korea .

Pathology

NuCLS

The NuCLS dataset contains over 220,000 labeled nuclei from breast cancer images from The Cancer Genome Atlas( TCGA). These nuclei were annotated through the collaborative effort of pathologists, pathology residents, and medical students.

Imaging

CheXpert

CheXpert is a  public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients from Stanford Hospital.

Cancer/

Genetics

Genomic Data Commons(GDC) datasets

The GDC Portal is a platform from National Cancer Institute(NCI) with cancer related genomic data for 80,000+ cases.

Imaging

BIMCV-COVID19 Imaging Datasets

BIMCV-COVID19+ dataset is a large dataset with chest X-ray images  and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV).These iterations of the database include 7377 CR, 9463 DX and 6687 CT studies.

Imaging

VinBigData Chest X-ray abnormalities detection

Provided on Kaggle by the Vingroup Big Data Institute (VinBigData) aims to promote fundamental research and investigate novel and highly-applicable technologies.A dataset consisting of 18,000 images that have been annotated by experienced radiologists.

Cardiology

EchoNet -Dynamic

The EchoNet-Dynamic database includes 10,030 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes

Genetics/

Pharmacology

PGxCorpus: a Manually Annotated Corpus for Pharmacogenomics

941 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly genes variations, gene, drugs and phenotypes), and relationships between those.

General

CENTAUR LABS

40+ speciality classified list of open source datasets for healthcare with direct links to the datasets and more information.

General

DATA WORLD – HEALTHCARE

More than a 100 healthcare related datasets from around the world, classified and annotated.

General

Determinants of COVID-19 mortality in the United States dataset (BrainX)

Dataset created for the purpose of continuing research into COVID-19. However with information from all 50 states and the District of Columbia, many US statistics can be compared.

Pharmacology

Drug Induced Liver injury(DILI) Dataset

The DILIrank dataset is an updated version of the LTKB Benchmark dataset. DILIrank consists of 1,036 FDA-approved drugs that are divided into four classes according to their potential for causing drug-induced liver injury (DILI).

Ophthalomology

SUSTech -SYSU dataset

Dataset for automatically segmenting and classifying corneal ulcers with 712 ocular staining images and the associated segmentation labels for flaky corneal ulcers.

General

Harvard Dataverse

4000+ healthcare datasets made available from Harvard University.Searchable and diverse.

Pathology

PanNUke Dataset

Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources.

Imaging

ACR COVID-19 Imaging Dataset

A dataset with Images,mainly Chest X-rays from COVID-19 patients.

General

C3.ai COVID-19 Data Lake

Multiple data sources for COVID-19 in a unified data model, ready for analysis at one place.

General

COVID-19 Open Research Dataset Challenge (CORD-19)

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

General

Novel Corona Virus 2019 Dataset

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.

The data is available since 22 Jan, 2020.

Imaging

The RSNA 2019 Brain CT Hemorrhage Dataset.

Largest collection of Intracranial hemorrhage CT scans.874 035 images with expert annotations.

 

Reference: Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge

General

PHYSIONET(MIMIC/eICU Collaborative)

One of the most comprehensive source of many datasets in healthcare.Primarily from ICU patients.

https://physionet.org/about/database/

MIMIC – IV Dataset (https://physionet.org/content/mimiciv/0.4/)

Includes:

  • Clinical datasets such as MIMIC,eICU collaborative and Pediatic ICU datasets.
  • Waveform datasets with ECG,EEG,arterial blood pressure waveform.
  • ECG datasets with various pathophysiologic changes and drug interactions.
  • Fetal datasets including sounds and ECG.
  • Gait and Balance datasets include gait dynamics for patients with various neurodegenerative disorders.
  • Neuro and Myoelectic datasets with EEG,EMG and evoked potential waveforms.
  • Image datasets with Chest X-rays and MRI images.
  • Computed Tomography Images for Intracranial Hemorrhage Detection and Segmentation
  • Miscellaneous datasets with text, language,posture and other datasets

Imaging/

Neurology

ADNI Database

Alzheimer’s disease patient’s imaging(MRI), clinical, genomic, and biomarker data for the purposes of scientific investigation, teaching, or planning clinical research studies.

http://adni.loni.usc.edu/data-samples/access-data/

Ophthalomology

RIM-ONE

RIM-ONE is a database for optic disc and cup segmentation evaluation by Medical Image Analysis group.

Critical Care

AmsterdamUMCdb

Contains data related to 23,376 intensive care unit and high dependency unit admissions at Amsterdam University Medical Center of adult patients from 2003-2016.

 

Pharmacology

FDA Adverse Event Reporting System (FAERS)

The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA

Microbiology

Malaria Dataset

A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.The dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells.

Ophthalomology

RIGA Dataset :Retinal fundus images for glaucoma analysis

A de-identified dataset of retinal fundus images for glaucoma analysis (RIGA) derived from three sources with 750 original images and 4500 manual marked images

 

Ophthalomology

High-Resolution Fundus (HRF) Image Database

The public database contains 15 images of healthy patients, 15 images of patients with diabetic retinopathy and 15 images of glaucomatous patients.

Ophthalomology

DR HAGIS:Diabetic Retinopathy, Hypertension, Age-related macular degeneration and Glacuoma ImageS

39 images for development of vessel extraction algorithms suitable for retinal screening programmes.

Cancer

NLST Datasets: National Cancer Institute

Datasets from National Cancer Institute of over 54000 patients. They include data on participant characteristics, screening exam results, diagnostic procedures, lung cancer, and mortality. Images from over 75,000 CT screening exams are available. Over 1,200 pathology images from a subset of NLST lung cancer patients (~500 of over 2,000 patients) may be viewed.

Pulmonary

NSRR Datasets:National Sleep Research Resource

Polysomnography dataset from NSRR for sleep studies.Large collection of deidentified physiologic signals perfect for ML development.

Dermatology

The HAM10000 dataset

A large collection of multi-source dermatoscopic images of common pigmented skin lesions containing 10000 images.

Related publication:The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

General

UCI Machine Learning Repository

This open source repository has more than 400 datasets including healthcare(100+) and non-healthcare ones in searchable and categorized format.

General

Centers for Medicare and Medicaid(CMS) datasets with ResDAC link.

CMS datasets provide US Medicare and Medicaid datasets.

ResDAC(The Research Data Assistance Center) provides free support to users of CMS datasets.Link:  https://www.resdac.org/learn

General

Center for disease control(CDC) Datasets

Center for Disease Control’s datasets.Useful for incidence,prevalance of various disorders and mortality data from across the US.

General

Healthcare Cost and Utilization Project (HCUP) datasets

Agency for Healthcare Research and Quality’s HCUP datasets used to identify, track, and analyze US national trends in health care utilization, access, charges, quality, and outcomes.

General

NHS datasets

UK government’s National Health services datasets.NHS choices datasets are useful for NLP and sentiment analysis both for GPs and hospitals.

Imaging

OASIS Brain MRI dataset

Brain MRI datasets from Open Accesss series of Imaging Studies(OASIS).

Neurology

OpenNEURO

A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data with over 200 datasets.

Cancer

National Cancer Institute(NCI)-SEER datasets

Cancer epidemiology data available through NCI’s Surveillance,Epidemiology and End Result Program(SEER).

Cancer/

Genetics

BROAD Institute’s Cancer program datasets

Cancer and genomics datasets.

Imaging

MURA

A dataset of 14,000+ anonymized, radiologist labeled musculoskeletal X-rays from 12,000+ patients from Stanford ML group.

 

Related publication: https://arxiv.org/abs/1712.06957

Imaging

fastMRI

1500+ knee MRI anonymized dataset from NYU.

General

NLTK : Natural language toolkit

One stop to learn Natural Language processing and more.

Related publication: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

General

DAIR.AI

An excellent resource for trends and updates in AI, especially NLP by Elvis Saravia.

General

Data science article collection

An excellent collection of articles on data science.

General

Google Dataset Search

Google’s powerful search engine to assist with dataset search.

Imaging

NIH CXR14 dataset

Over 100,000 anonymized chest x-ray images and their corresponding data from more than 30,000 patients, including many with advanced lung disease.

Imaging

NIH Deep Lesion

NIH release of  a dataset containing 32,000 CT scan images with annotated lesions  belonging to 4400 unique patients.

General

Blue Button 2.0

A CMS initiative to democratize research and development using beneficiary data.Greater than 70 million patient dataset available.

General

National Institute of Health

The link below is for NIH’s strategic plan for data science in healthcare.A must read for anyone using data in healthcare for research and innovation

Imaging

NIH Clinical Center

Largest open source Chest X-Ray data set available through NIH’s clinical center.See the link in the article to access the data.Also available through GITHUB and KAGGLE.

General

GITHUB

One of the the largest and most advanced software development platform in the world with many datasets and repositories.

General

KAGGLE

Kaggle is a great resource for de-identified datasets in healthcare.

General

DataMed

A biomedical data search engine which searches for datasets across registries.

General

Mendeley

A place to store, share or find data.A platform for biomedical  research.

General

Nature

Detailed data repositories for biomedical research especially proteins and genetics.