General/
Generative AI/
LLM
GPTNERMED: NER dataset for German Medical text
GPTNERMED is a novel open synthesized dataset and neural named-entity-recognition (NER) model for German texts in medical natural language processing (NLP).This dataset contains the synthetic German sentences with annotated entities (Medikation
, Dosis
, Diagnose
) from the GPTNERMED project. The sentences as well as the annotations are not manually validated by medical professionals and therefore this dataset is no gold standard dataset.The dataset consists of 9,845 sentences (121,027 tokens by SpaCy Tokenizer, 245,107 tokens by the GPT tokenizer).
Related publication: Frei J, Kramer F. Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform. 2023 Sep;145:104478.