Available Thesis Topics

BQL (google graph store) to SPARQL bridge (Master)

Topic

Supervisor

Contact

BQL (google graph store) to SPARQL bridge (Master)

Prof. Axel Ngonga

Prof. Axel Ngonga

Google's BadWolf (https://github.com/google/badwolf) implements BQL, a query language for a graph model loosely modelled around RDF. But how loosely? The core of this thesis is to create a converter from BQL to SPARQL and vice- versa, which will allow benchmarking the performance of BadWolf against that of existing triple stores. The main tasks will hence be to study the calculus behind BQL, map it to the SPARQL calculus, implement the mapping and run benchmarks derived from real query logs on BadWolf and 3 selected triple stores (Virtuoso, GraphDB and TENTRIS). 

Benchmarking BadWolf using IGUANA 2.0 (Bachelor)

Topic

Supervisor

Contact

Benchmarking BadWolf using IGUANA 2.0 (Bachelor)

Felix Conrads

Prof. Axel Ngonga

Triple stores answer to SPARQL queries and play a central role in the management of RDF data as they are the backbone of a plethora of Linked-Data-driven applications. However, it is often unclear how triple stores perform for particular use cases. Reference benchmarking frameworks such as IGUANA allow to benchmark triple stores and achieve comparable results. With BadWolf, Google developed a store able to deal with queries in BQL, a language similar to SPARQL. The goal of this thesis is to develop a benchmark which allows to evaluate the performance of BadWolf on loads which can be expressed in an equivalent way in both SPARQL and BQL. The student is expected to (1) develop the benchmark, (2) execute the benchmark using IGUANA, (3) report his/her findings as to the pros and cons of BadWolf.


Data: http://iguana-benchmark.eu/, https://github.com/google/badwolf  

Usage of X2vec for dataset search (Master)

TopicSupervisorContact

Usage of X2vec for dataset search

Michael Röder

Michael Röder

 

For this thesis, the student should get an overview over the different X2vec methods that exist to transform a given dataset into a vector representation. This representation should be used for calculating similarity values to other, indexed datasets as done with Tapioca (http://aksw.org/projects/Tapioca.html).

Development of a benchmark for dataset linkage recommendation systems (Master)

The thesis will develop a benchmark for dataset similarity approaches like Tapioca. The goal of the benchmark is to measure how well a dataset linkage recommendation system orders the available datasets based on a given datasets (comparable to a “normal” information retrieval search engine).

To be able to measure the quality of the ranking, the student will use a classification task. The task is mainly based on the RDF dataset which will be used as a query for the dataset linkage recommendation system. The assumption is that a good recommendation system should provide datasets that increase the classifiers performance on its classification task.

As a stretch goal, the benchmark should be executable on the HOBBIT platform.

Supervisor: Michael Röder

Optimization techniques for federated SPARQL query processing

TopicSupervisorContact

Optimization techniques for federated SPARQL query processing

Dr. Muhammad Saleem

Dr. Muhammad Saleem

 

This thesis will explore the different  optimization techniques used in the distributed SPARQL query processing. In particular, the source selection, they index, the join ordering and query planning, the different join implementations etc.

Optimization techniques in Triple stores for SPARQL query processing

TopicSupervisorContact

Optimization techniques in triple stores for SPARQL query processing

Dr. Muhammad Saleem

Dr. Muhammad Saleem

 

This thesis will explore the different  optimization techniques used in the state of the art triple stores including the data representation and storage, indexing, the join ordering and query planning, the different join implementations etc.

Analysis of the relative errors in cardinality-based SPARQL federation engines

TopicSupervisorContact

Analysis of the relative errors in cardinality-based SPARQL federation engines.

Dr. Muhammad Saleem

Dr. Muhammad Saleem

 

This thesis will investigate how good is the query plan generated by the underlying cost-based distributed SPARQL engine in terms of the relative error.  The relative error is a performance measure which tell how accurate is the estimated result size of the triple patterns or joins between triple patterns. The more accurate estimation leads to better query execution time.

Extend Bengal towards a new language (Master)

TopicSupervisorContact

Extend Bengal towards a new language

Diego Moussallem

Michael Röder, Diego Moussallem

 

Extracting structured information which can be described in RDF from unstructured data like natural language text is one of the major fields of the Semantic Web and Natural Language Processing communities. There are several approaches to tackle this problem and several benchmarks for evaluating the performance of these approaches (see http://gerbil.aksw.org/).

The main disadvantage of these benchmarks is their size. Since an information extraction benchmark dataset has to be created manually its creation is expensive and its size is limited. This makes it nearly impossible to benchmark the different approaches in the area of Big Linked Data.
To this end, the DICE group developed an approach to automatically generate natural language documents based on a given knowledge base (https://github.com/dice-group/Bengal). These documents can be used for benchmarking Named Entity Recognition and Linking approaches regarding their efficiency when it comes to handle large amounts of data.

The focus of this master thesis is to extend Bengal to another language than English or Portuguese and benchmark at least two approaches for this language. Underrepresented languages are very welcome.

Veracity of Knowledge Bases (Master)

In the semantic web, every triple is handled as a true fact. However, in the time of fake news, this does not have to hold. Therefore, the area of Fact Checking is developing approaches to check single facts. 

In this thesis, the student should develop an approach to check the veracity of a complete Knowledge Base. However, Knowledge Bases can become huge and checking every single fact would be too time consuming. Therefore, an approach has to be developed to determine central statements of the Knowledge Base which can be used for checking.

Supervisor: Michael Röder

Argument Mining on Knowledge Bases (Master)

In the semantic web, every triple is handled as a true fact. However, in the time of fake news, this does not have to hold. Therefore, the area of Fact Checking is developing approaches to check single facts. 

Typically, a Fact Checking system tries to search for evidences that show whether a fact is true or false. Unfortunately, it is not always possible to find such evidences. For these situations, different approaches are necessary. In this thesis, the student should develop an approach to use the facts in a given Knowledge Base as arguments to prove or refute a given fact.

Supervisor: Michael Röder

Hybrid Fact Checking (Master)

Hybrid Fact Checking (Master)

In the semantic web, every triple is handled as a true fact. However, in the time of fake news, this does not have to hold. Therefore, the area of Fact Checking is developing approaches to check single facts. 

At the moment, there are two distinct approaches for that. Either, a system searches for textual evidences or it tries to find evidences in a given knowledge base. In this thesis, the student should combine these two approaches. A possible combination could be to find paths that are expected to be in the knowledge base if the fact is true. If the paths can not be found because single triples are missing, a textual Fact Checking system could be used to search for evidences for this single, missing fact.

Supervisor: Michael Röder

Negative Fact Checking (Master)

Negative Fact Checking (Master)

In the semantic web, every triple is handled as a true fact. However, in the time of fake news, this does not have to hold. Therefore, the area of Fact Checking is developing approaches to check single facts. 

We developed FactCheck, an approach that searches for evidences of a fact in a given corpus. However, at the moment, FactCheck is taking only evidences into account which prove the given fact. In this thesis, the student should extend FactCheck to enable it to find refuting textual evidences in a given text corpus.

Supervisor: Michael Röder

Applicability of SQL/SPARQL Query Optimizations to Tensor Algebra (Bachelor / Master)

TopicSupervisorContact

Applicability of SQL/SPARQL Query Optimizations  to Tensor Algebra

Prof. Axel Ngonga

Alexander Bigerl

Task

Today, tensors are used in many computationally demanding areas, such as machine learning in deep learning, quantum physics and chemistry, or in the tensor-based triplestore (Tentris) developed by the Data Science working group. Sparsely occupied tensors play a special role, i.e. those for which most entries are zero.

In this work, the student first obtains an overview of well-established methods for query optimization in SQL and SPARQL and then analyses the applicability to a (simplified) tensor algebra. Promising optimization approaches are also to be implemented and evaluated for Tentris.

In short: What is a tensor?

Well, a simple number, also called a scalar, is a 0-dimensional object. A vector is obviously 1 dimensional and a matrix has 2 dimensions. If you now imagine a matrix that has a depth like a cube, then you have a "tensor of rank 3" or with three dimensions. Tensors can therefore be seen as a generalization of vectors and matrices with any number of dimensions.

Processing Order Metrics for Einstein Summation Conventions to Sparse Tensors (Bachelor)

TopicSupervisorContact

Implementation of Processing Order Metrics when applying Einstein Summation Conventions to Sparse Tensors

Prof. Axel Ngonga

Alexander Bigerl

Task

Today, tensors are used in many computationally demanding areas, such as machine learning in deep learning, quantum physics and chemistry, or in the tensor-based triplestore (Tentris) developed by the Data Science working group. Sparsely occupied tensors play a special role, i.e. those for which most entries are zero.

In the work, the student first gains an overview of established metrics for join-ordering in database systems. The metrics are to be further examined for their transferability to the evaluation of the Einstein Convention on sparse tensors. Promising approaches will also be implemented and evaluated for Tentris.

In short: What is a tensor?

Well, a simple number, also called a scalar, is a 0-dimensional object. A vector is obviously 1 dimensional and a matrix has 2 dimensions. If you now imagine a matrix that has a depth like a cube, then you have a "tensor of rank 3" or with three dimensions. Tensors can therefore be seen as a generalization of vectors and matrices with any number of dimensions.

... and Einstein summation convention?

Jump in and try it. This blog post shows how to use the Einstein summation convention (einsum) to simplify complicated matrix operations: http://ajcr.net/Basic-guide-to-einsum/

Bulk loading facts into the fast triple store Tentris (Master)

TopicSupervisorContact

Bulk loading facts into the fast triple store Tentris

Prof. Axel Ngonga

Alexander Bigerl

Task

Triple stores are used to serve answers (in the form of result sets) to SPARQL queries. The newly developed Tentris triple store is currently one of the fastest triple stores available. The aim of this thesis is to devise an efficient approach for loading and insert data into the triple store (bulk loading). As memory is always limited, you must favor ways that further compress the stored data while preserving random accessibility. The thesis includes a theoretical discussion of possible approaches, an implementation of the two most promising ones and an evaluation of their performance w.r.t. the scalability of the insertion of facts (in facts/second).

Link Discovery over geo RDF data: content similarity approach (Bachelor)

TopicSupervisorContact

Link Discovery over geo RDF data: content similarity approach

Abdullah Ahmed

Abdullah Ahmed

Geo RDF is a very important part of linked data. Linking datasets contain geo data is very interesting
topic in academia and industry. Accordingly, many approach have been introduced to address the
problem of link discovery over geo data taking in account the scalability and accuracy as two central
factors when we implement such a framework. In this work, we plan to implement a Link Discovery
(LD) over geo RDF dataset then we compare it with current state of the art (e.g. RADON, Sherif et al).
The work will be as follows:
1- literature review on LD over geo RDF, topological relations such as 9IM-ED
2- Implementing content measure in JAVA based on the paper ( GODOY et al )
3- Evaluate the approach on real datasets such as NUTs
4- Compare the results with RADON algorithms in term of scalability( run time) accuracy (F- measure
5- publishing the results in a scientific conference in case of a promising results
Requirements:
1- Java programming (good practical experience)
2- Math knowledge ( calculas)
3- RDF, Semantic, Topological geomatric relations
References:
1- Godoy et al: Defining and Comparing Content Measures of Topological Relations
2-Sherif et al: Radon— Rapid Discovery of Topological Relations

Sentence simplification (Bachelor)

TopicSupervisorContact

Sentence simplification (Bachelor)

René Speck

René Speck

The goal of this thesis is the simplification of linguistically complex sentences. The task of extracting simple sentences from a complex input sentence is essentially the task of generating a particular subset of the possible sentences that a reader would assume to be true after reading the input.

Michael Heilman and Noah A. Smith published algorithms "Extracting simplified statements for factual question generation" to simplify sentences in: "In Proceedings of QG2010: The Third Workshop on Question Generation" 2010. The thesis includes the implementation of the algorithms of Heilman and Smith as well as the evaluation of the performance on at least one dataset.

Several already implemented approaches for knowledge extraction, for instance, FOX (github.com/dice-group/FOX) and Ocelot (github.com/dice-group/Ocelot) could help to fulfill the goal of this thesis.

 

Open Data is made available increasingly by government authorities and local authorities. The extent to which datasets are used depends on their findability and the corresponding metadata. In the Open Data Portal Germany (OPAL) [1], metadata is semantically annotated (RDF/Semantic Web/Linked Data) and enriched. The resulting data can then be used further by humans and machines.

In this Bachelor Thesis a Social Bot has to be developed. It will provide information about suitable data sets in Social Media when respective questions are asked. The implementation of the bot will be modular, so the core module is independent from a concrete network. In addition, a suitable bot module should be developed for at least one network, e.g. Twitter or Slack. Current Question Answering technologies such as HAWK [2] will be applied and adapted to the concrete requirements of database and queries.

 

[1] projekt-opal.de

[2] github.com/dice-group/hawk

Social Bot for Open Metadata (Bachelor)

TopicSupervisorContact

Social Bot for Open Metadata (Bachelor)

Adrian Wilke

Adrian Wilke

Offene Daten werden vermehrt von Behörden und Kommunen bereitgestellt. Wie häufig die Datensätze verwendet werden, hängt häufig von ihrer Auffindbarkeit und den entsprechenden Metadaten ab. Im Open Data Portal Germany (OPAL) [1] werden Metadaten semantisch annotiert (RDF/Semantic Web/Linked Data) und veredelt. Die resultierenden Daten können anschließend von Mensch und Maschinen weiterverwendet werden

In dieser Bachelorarbeit soll ein Social Bot entwickelt werden. Dieser informiert in Sozialen Medien über geeignete Datensätze, wenn entsprechende Fragen gestellt werden. Die Implementierung des Bots soll modular erfolgen, sodass das Kernmodul unabhängig von einem konkreten Netzwerk ist. Zusätzlich soll für mindestens ein Netzwerk, z.B. Twitter oder Slack, ein geeignetes Bot-Modul entwickelt werden. Hierzu werden aktuelle Question-Answering-Technologien wie HAWK [2] zum Einsatz kommen und auf die konkreten Anforderungen von Datenbasis und Anfragen angepasst.

 

Open Data is made available increasingly by government authorities and local authorities. The extent to which datasets are used depends on their findability and the corresponding metadata. In the Open Data Portal Germany (OPAL) [1], metadata is semantically annotated (RDF/Semantic Web/Linked Data) and enriched. The resulting data can then be used further by humans and machines.

In this Bachelor Thesis a Social Bot has to be developed. It will provide information about suitable data sets in Social Media when respective questions are asked. The implementation of the bot will be modular, so the core module is independent from a concrete network. In addition, a suitable bot module should be developed for at least one network, e.g. Twitter or Slack. Current Question Answering technologies such as HAWK [2] will be applied and adapted to the concrete requirements of database and queries.

 

[1] projekt-opal.de

[2] github.com/dice-group/hawk

Integration and Lifting of Question Answering Datasets (Bachelor)

TopicSupervisorContact
Integration and Lifting of Question Answering DatasetsDaniel VollmersDaniel Vollmers

Currently, there are more than 30 datasets from over 20 years research. All these datasets are in different formats and forms and their Question-Answer pairs can only be answered on specific underlying datasets.

In this thesis, the student will analyse the features of all these datasets and propose a solution to lift and access this benchmark to 5-star data (http://5stardata.info/en/). Answers will be grounded in knowledge bases via machine learning methods. Finally, the lifted datasets will be integrated into the renowned framework, GERBIL QA (http://gerbil-qa.aksw.org/gerbil/).

Source Code: https://github.com/dice-group/NLIWOD/tree/master/qa.datasets

Disclaimer

For most theses, the required skills include good knowledge of the Java, C++ or Python programming languages and the willingness to delve into exciting research. Students will be provided with the opportunity to impact the whole of the Semantic Web and Data Science community. Furthermore, we will offer close supervision during the writing of your thesis.

The general structure of writing a thesis in the DICE group is described at https://dice-group.github.io/theses/. The development will be carried out using Git in a Scrum-like setting. If you do not find a topic that fits your interests, you can also have a look at our github repository (https://github.com/dice-group) to get some additional impressions and send us your ideas!

527efb333