← Go back

Extend Bengal towards a new language

Master Thesis

Extracting structured information which can be described in RDF from unstructured data like natural language text is one of the major fields of the Semantic Web and Natural Language Processing communities. There are several approaches to tackle this problem and several benchmarks for evaluating the performance of these approaches (see http://gerbil.aksw.org/).

The main disadvantage of these benchmarks is their size. Since an information extraction benchmark dataset has to be created manually its creation is expensive and its size is limited. This makes it nearly impossible to benchmark the different approaches in the area of Big Linked Data. To this end, the DICE group developed an approach to automatically generate natural language documents based on a given knowledge base (https://github.com/dice-group/Bengal). These documents can be used for benchmarking Named Entity Recognition and Linking approaches regarding their efficiency when it comes to handle large amounts of data.

The focus of this master thesis is to extend Bengal to another language than English or Portuguese and benchmark at least two approaches for this language. Underrepresented languages are very welcome.