General/
Generative AI/
LLM
Openai: healthbench-professional Eval dataset
HealthBench Professional contains 525 physician-authored tasks spanning three clinician-facing use cases: care consult, writing and documentation, and medical research. Each example is designed to evaluate the next model response in a single-turn or multi-turn conversation between a clinician and a model, and is graded via example-specific criteria, similar to HealthBench. HealthBench Professional was built through a process of physician authored annotations with extensive vetting and quality control. A total of 190 physicians contributed to the effort, with practice experience across 50 countries and 26 medical specialties.
Related publication: HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats.Rebecca Soskin Hicks, et al..