Surgery
SurgLaVi Dataset
SurgLaVi is the largest and most diverse surgical vision–language dataset to date, comprising nearly 240k clip–caption pairs from more than 200 procedures, and comprising hierarchical levels at coarse-, mid-, and fine-level. At the core of SurgLaVi lies a fully automated pipeline that systematically generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units. Within this framework, the resulting captions are enriched with contextual detail, producing annotations that are both semantically rich and easy to interpret. To ensure accessibility, the researchers release SurgLaVi-β, an open-source derivative of ~113k clip–caption pairs constructed entirely from public data, which is over four times larger than existing surgical VLP datasets.