DICE Colloquium, 21.06.2019: “TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications”
4 months ago
At the DICE Colloquium on Friday 21st of June, 2019 Daniel Vollmers presented the paper TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications submitted in ISWC2018 by Sepideh Mesbah and her colleagues.
The Approach faces the problem of generating training-data for supervised NER Approaches, to learn models, which are able to extract long-tail entities like datasets or research methods from scientific publications.
“Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These entities (e.g. “WebKB”,“StatSnowball”) are rare and often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes. State-of-the-art NER approaches employ supervised machine learning models, trained on expensive type labeled data laboriously produced by human annotators. A common workaround is the generation of labeled training data from knowledge bases; this approach is not suitable for long-tail entity types that are, by definition, scarcely represented in KBs. This paper presents an iterative approach for training NER and NET classifiers in scientific publications that relies on minimal human input, namely a small seed set of instances for the targeted entity type. We introduce different strategies for training data extraction, semantic expansion, and result entity filtering. We evaluate our approach on scientific publications, focusing on the long-tail entities types Datasets, Methods in computer science publications, and Proteins in biomedical publications.”