Data Analysis

The Data Analysis group works on two main fields.

Firstly, we are gathering, preparing and analysing Linked Data. The first part of this pipeline is done by using our open-source crawler Squirrel. This crawler has been used through several projects, including the two research projects OPAL and LIMBO.

After data has been gathered, we provide Fact Checking services that can be used to ensure the veracity of the data with respect to a reference knowledge base or a reference corpus.

The second main field this group is working on is benchmarking. We are maintaining several benchmarking platforms:

  • HOBBIT is a holistic benchmarking platform for Big Linked Data solutions.
  • GERBIL is a light-weight platform for benchmarking web services. Currently we are supporting the benchmarking within three areas:
    • Knowledge Extraction
    • Question Answering
    • Knowledge Base Curation
  • IGUANA is a benchmarking platform for evaluating the performance of triple stores.
Additionally, we are in general interested in the area of benchmarking and are providing several benchmarks,

Data Integration

The Linked Data paradigm builds upon the backbone of distributed knowledge bases connected by typed links. The mere volume of current knowledge bases as well as their sheer number pose two major challenges when aiming to support the integration of data across and within them. The first is that tools for data integration have to be time-efficient specially when to deal with big datasets. Secondly, these tools have to carry out the data integration tasks of high quality to serve the applications built upon Linked Data well. Our solutions to the second problem build upon efficient computational approaches developed to solve the first and combine these with dedicated machine learning techniques. All our frameworks for data integration such as LIMES and DEER are open-source and available under a GNU license at https: // together with user/developer manuals.

Data Portals

Open Data opens up social and commercial chances. For instance, mobility data like routes of trains and buses can be combined with other data by citizens and companies to plan excursions for special interests. Metadata records and data itself are often distributed across multiple systems and available in different formats. The Data Portals Group works, inter alia, on the following topics: Data on multiple systems and in different formats has to be crawled and stored. For this purpose 5-Star-Linked-Data and appropriate ontologies are used. The stored data is then analyzed and refined. For huge amounts of data, related quality ratings and data enrichments are preferably executed using automatic approaches. Enrichments, for instance, are the recognition of named entities like locations, and the following linking of the data to related geo data, like coordinates or regions. Additional knowledge bases can be integrated to assist in further tasks, e.g. finding appropriate licenses for the distribution of combined datasets. Knowledge graphs are also used to overcome barriers of natural language, like matching synonyms and therefore improve the findability of datasets. This is also developed by specialized interfaces like Question Answering systems.

Data Storage and Querying

The constant growth of Linked Data on the Web gives rise to new challenges for querying and integrating massive amounts of data. Such datasets are available through various interfaces, such as data dumps, Linked Data documents and webpages, SPARQL endpoints, Triple Pattern Fragments, or the Linked Data Platform. In addition, various sources produce streaming data. Efficiently querying these sources is of central importance for the scalability of Linked Data and Semantic Web technologies. To exploit the massive amount of data to its full potential, users should be able to store, query, and combine this data easily and effectively. The DSQ group develops scalable and high performance RDF sytems for storing and querying Big RDF data. In addition, we are also working on knowledge extraction from the Web and their RDF graph modelling. Finally, we are keen to design most representative benchmarsk pertaining to the storing, querying, and extracting RDF data.

Machine Learning

The machine learning group focuses on machine learning technologies for knowledge graphs. The research goal is to develop novel, scalable machine learning algorithms, e.g., for entity embeddings, clustering of high-dimensional data, and explainable machine learning. Research activities of the ML group are in the scope of structured machine learning, e.g., learning concepts in description logics, and entity embeddings, e.g., based on physical models and convolutional complex models. The group provides open-source tools for machine learning on knowledge graphs.

NLP and Data Access

Our group works on the intersection of Data Science and Natural Language Processing areas. We focus on creating algorithms that allow computers to extract automatically large-scale knowledge from unstructured data and process them while preserving their semantic key information. We aim to make the acquired knowledge accessible and understandable for both humans and computers. Our team has started addressing two of the most important tasks in NLP by relying on Knowledge Graphs, Named Entity Recognition, and Entity Linking. Our research resulted in two state-of-the-art frameworks in respect of multilingualism and knowledge-graph-based algorithms. Recently, we expanded our focus on different NLP tasks ranging from basic research in computational linguistics to Question Answering, Machine Translation, Natural Language Generation, and Understanding.

Our group includes members of both Paderborn and Leipzig universities.