General
HC4 (Healthcare Comprehensive Commons Corpus)
HC4 is a large-scale pretraining dataset containing over 65 billion tokens from diverse healthcare-related sources. The corpus was curated to enable systematic investigation of how data composition influences language model behavior, including potential demographic biases. 153GB (around 65 billion tokens). 9.7+ million documents from diverse sources including peer-reviewed scientific literature collected from PubMed Central, Semantic Scholar, OpenAlex repositories.
Related publication: Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency (EMNLP 2025). https://arxiv.org/pdf/2510.18556