Available Thesis Topics

BQL (google graph store) to SPARQL bridge (Master)

Topic

Supervisor

Contact

BQL (google graph store) to SPARQL bridge (Master)

Prof. Axel Ngonga

Prof. Axel Ngonga

Google's BadWolf (https://github.com/google/badwolf) implements BQL, a query language for a graph model loosely modelled around RDF. But how loosely? The core of this thesis is to create a converter from BQL to SPARQL and vice- versa, which will allow benchmarking the performance of BadWolf against that of existing triple stores. The main tasks will hence be to study the calculus behind BQL, map it to the SPARQL calculus, implement the mapping and run benchmarks derived from real query logs on BadWolf and 3 selected triple stores (Virtuoso, GraphDB and TENTRIS). 

Benchmarking BadWolf using IGUANA 2.0 (Bachelor)

Topic

Supervisor

Contact

Benchmarking BadWolf using IGUANA 2.0 (Bachelor)

Felix Conrads

Prof. Axel Ngonga

Triple stores answer to SPARQL queries and play a central role in the management of RDF data as they are the backbone of a plethora of Linked-Data-driven applications. However, it is often unclear how triple stores perform for particular use cases. Reference benchmarking frameworks such as IGUANA allow to benchmark triple stores and achieve comparable results. With BadWolf, Google developed a store able to deal with queries in BQL, a language similar to SPARQL. The goal of this thesis is to develop a benchmark which allows to evaluate the performance of BadWolf on loads which can be expressed in an equivalent way in both SPARQL and BQL. The student is expected to (1) develop the benchmark, (2) execute the benchmark using IGUANA, (3) report his/her findings as to the pros and cons of BadWolf.


Data: http://iguana-benchmark.eu/, https://github.com/google/badwolf  

Development of a benchmark for dataset linkage recommendation systems (Master)

A central aim of linked data is to interlink datasets. Therefore, a newly created dataset should be linked to existing datasets. Since no user can have an overview over thousands of existing datasets, search engines like Tapioca exist to retrieve datasets that might be candidates for links for a given dataset.

The goal of the thesis is the development of a benchmark for such dataset similarity approaches. The goal of the benchmark is to measure how well a dataset linkage recommendation system orders the available datasets based on a given datasets (comparable to a “normal” information retrieval search engine).

To be able to measure the quality of the ranking, the student will use either a classification or a fact checking task. The task is mainly based on the RDF dataset which will be used as a query for the dataset linkage recommendation system. The assumption is that a good recommendation system should provide datasets that increase the classifiers (or fact checkers) performance on its task. This increase of the performance (Delta@n in the figure above) is used as the performance indicator of the recommendation system.

As a stretch goal, the benchmark should be executable on the HOBBIT platform.

Supervisor: Michael Röder

Related projects:

Usage of X2vec for dataset search (Master)

A central aim of linked data is to interlink datasets. Therefore, a newly created dataset should be linked to existing datasets. Since no user can have an overview over thousands of existing datasets, search engines like Tapioca exist to retrieve datasets that might be candidates for links for a given dataset.

 

For this thesis, the student should get an overview over the different X2vec methods that exist to transform a given RDF dataset into a vector representation. This representation should be used for calculating similarity values to other, indexed datasets as done with Tapioca.

Supervisor: Michael Röder

Optimization techniques for federated SPARQL query processing

TopicSupervisorContact

Optimization techniques for federated SPARQL query processing

Dr. Muhammad Saleem

Dr. Muhammad Saleem

 

This thesis will explore the different  optimization techniques used in the distributed SPARQL query processing. In particular, the source selection, they index, the join ordering and query planning, the different join implementations etc.

Optimization techniques in Triple stores for SPARQL query processing

TopicSupervisorContact

Optimization techniques in triple stores for SPARQL query processing

Dr. Muhammad Saleem

Dr. Muhammad Saleem

 

This thesis will explore the different  optimization techniques used in the state of the art triple stores including the data representation and storage, indexing, the join ordering and query planning, the different join implementations etc.

Analysis of the relative errors in cardinality-based SPARQL federation engines

TopicSupervisorContact

Analysis of the relative errors in cardinality-based SPARQL federation engines.

Dr. Muhammad Saleem

Dr. Muhammad Saleem

 

This thesis will investigate how good is the query plan generated by the underlying cost-based distributed SPARQL engine in terms of the relative error.  The relative error is a performance measure which tell how accurate is the estimated result size of the triple patterns or joins between triple patterns. The more accurate estimation leads to better query execution time.

Extend Bengal towards a new language (Master)

TopicSupervisorContact

Extend Bengal towards a new language

Diego Moussallem

Michael Röder, Diego Moussallem

 

Extracting structured information which can be described in RDF from unstructured data like natural language text is one of the major fields of the Semantic Web and Natural Language Processing communities. There are several approaches to tackle this problem and several benchmarks for evaluating the performance of these approaches (see http://gerbil.aksw.org/).

The main disadvantage of these benchmarks is their size. Since an information extraction benchmark dataset has to be created manually its creation is expensive and its size is limited. This makes it nearly impossible to benchmark the different approaches in the area of Big Linked Data.
To this end, the DICE group developed an approach to automatically generate natural language documents based on a given knowledge base (https://github.com/dice-group/Bengal). These documents can be used for benchmarking Named Entity Recognition and Linking approaches regarding their efficiency when it comes to handle large amounts of data.

The focus of this master thesis is to extend Bengal to another language than English or Portuguese and benchmark at least two approaches for this language. Underrepresented languages are very welcome.

Veracity of Knowledge Bases (Master)

In the semantic web, every triple is handled as a true fact. However, in the time of fake news, this does not have to hold. Therefore, the area of Fact Checking is developing approaches to check single facts. 

In this thesis, the student should develop an approach to check the veracity of a complete Knowledge Base. However, Knowledge Bases can become huge and checking every single fact would be too time consuming. Therefore, an approach has to be developed to determine central statements of the Knowledge Base which can be used for checking.

Supervisor: Zafar SyedMichael Röder

Related projects:

Hybrid Fact Checking (Master)

Hybrid Fact Checking (Master)

In the semantic web, every triple is handled as a true fact. However, in the time of fake news, this does not have to hold. Therefore, the area of Fact Checking is developing approaches to check single facts. 

At the moment, there are two distinct approaches for that. Either, a system searches for textual evidences or it tries to find evidences in a given knowledge base. In this thesis, the student should combine these two approaches. A possible combination could be to find paths that are expected to be in the knowledge base if the fact is true. If the paths can not be found because single triples are missing, a textual Fact Checking system could be used to search for evidences for this single, missing fact.

Supervisor: Zafar SyedMichael Röder

Related projects:

Negative Fact Checking (Master)

Negative Fact Checking (Master)

In the semantic web, every triple is handled as a true fact. However, in the time of fake news, this does not have to hold. Therefore, the area of Fact Checking is developing approaches to check single facts. 

We developed FactCheck, an approach that searches for evidences of a fact in a given corpus. However, at the moment, FactCheck is taking only evidences into account which prove the given fact. In this thesis, the student should extend FactCheck to enable it to find refuting textual evidences in a given text corpus.

Supervisor: Zafar SyedMichael Röder

Related projects:

Applicability of SQL/SPARQL Query Optimizations to Tensor Algebra (Bachelor / Master)

Task

Today, tensors are used in many computationally demanding areas, such as machine learning in deep learning, quantum physics and chemistry, or in the tensor-based triplestore (Tentris) developed by the Data Science working group. Sparsely occupied tensors play a special role, i.e. those for which most entries are zero.

In this work, the student first obtains an overview of well-established methods for query optimization in SQL and SPARQL and then analyses the applicability to a (simplified) tensor algebra.

(Master) Promising optimization approaches are also to be implemented and evaluated for Tentris.

In short: What is a tensor?

Well, a simple number, also called a scalar, is a 0-dimensional object. A vector is obviously 1 dimensional and a matrix has 2 dimensions. If you now imagine a matrix that has a depth like a cube, then you have a "tensor of rank 3" or with three dimensions. Tensors can therefore be seen as a generalization of vectors and matrices with any number of dimensions.

Supervisor: Alexander Bigerl

Metaprogramming Tensor Data Structures (Master)

TopicSupervisorContact

Metaprogramming Tensor Data Structures (Master)

Prof. Axel Ngonga

Alexander Bigerl

Task

Flexible data structures like tensors require flexible programming techniques like Templates and Meta Programming. For our tensorbased RDF Triple Store we developed a highly flexible data structure based on advanced C++ Meta programming.

The student will add further features to an existing implementation and evaluate the change with respect to speed, compression and scalability. The thesis includes a theoretical discussion of the data structure and the optimization approaches.

Link Discovery Over Geo RDF Knowledge Graph: Content Similarity Approach (Bachelor)

TopicSupervisorContact

Link Discovery over geo RDF data: content similarity approach

Abdullah Ahmed

Abdullah Ahmed

Geo RDF knowledge graph is a very important part of the linked data. Linking datasets contain Geospatial data is very interesting
topic in academia and industry. Accordingly, many approachs have been introduced to address the
problem of link discovery over such data taking in account the scalability and the accuracy as two central
factors when such a framework to be implemented. In this work, we plan to implement a Link Discovery
(LD) over geo RDF dataset then we compare it with the current state of the art (e.g. RADON, Sherif et al).
The work will be as follows:
1- literature review on LD over geo RDF, topological relations such as 9IM-ED
2- Implementing content measure in JAVA based on the paper ( GODOY et al )
3- Evaluate the approach on real datasets such as NUTs
4- Compare the results with RADON algorithms in term of the scalability( run time) and the accuracy (F- measure
5- Publishing the results in a scientific conference in case of a promising results
Requirements:
1- Java programming (good practical experience)
2- Math knowledge ( Algebra)
3- RDF, Semantic, Topological geomatric relations
References:
1- Godoy et al: Defining and Comparing Content Measures of Topological Relations
2-Sherif et al: Radon— Rapid Discovery of Topological Relations

Sentence simplification (Bachelor)

TopicSupervisorContact

Sentence simplification (Bachelor)

René Speck

René Speck

The goal of this thesis is the simplification of linguistically complex sentences. The task of extracting simple sentences from a complex input sentence is essentially the task of generating a particular subset of the possible sentences that a reader would assume to be true after reading the input.

Michael Heilman and Noah A. Smith published algorithms "Extracting simplified statements for factual question generation" to simplify sentences in: "In Proceedings of QG2010: The Third Workshop on Question Generation" 2010. The thesis includes the implementation of the algorithms of Heilman and Smith as well as the evaluation of the performance on at least one dataset.

Several already implemented approaches for knowledge extraction, for instance, FOX (github.com/dice-group/FOX) and Ocelot (github.com/dice-group/Ocelot) could help to fulfill the goal of this thesis.

Social Bot for Open Metadata (Bachelor)

Offene Daten werden vermehrt von Behörden und Kommunen bereitgestellt. Wie häufig die Datensätze verwendet werden, hängt häufig von ihrer Auffindbarkeit und den entsprechenden Metadaten ab. Im Open Data Portal Germany (OPAL) werden Metadaten semantisch annotiert (RDF/Semantic Web/Linked Data) und veredelt. Zur Beschreibung der Daten wird das Data Catalog Vocabulary (DCAT) verwendet. Die resultierenden Daten können anschließend von Mensch und Maschinen weiterverwendet werden. Derzeit stehen in OPAL rund 800.000 Datensätze zur Verfügung.

In dieser Bachelorarbeit soll ein Social Bot entwickelt werden. Dieser informiert in Sozialen Medien über geeignete Datensätze, wenn entsprechende Fragen gestellt werden. Die Implementierung des Bots soll modular erfolgen, sodass das Kernmodul unabhängig von einem konkreten Netzwerk ist. Zusätzlich soll für mindestens ein Netzwerk, z.B. Twitter oder Slack, ein geeignetes Bot-Modul entwickelt werden. Hierzu werden aktuelle Question-Answering-Technologien wie HAWK, DBpedia chatbot oder TeBaQA  zum Einsatz kommen und auf die konkreten Anforderungen von Datenbasis und Anfragen angepasst.

Anforderungen: Java
Schwerpunkte der Arbeit: Implementierung, Recherche und Vergleich existierender QA-Ansätze, RDF, ggf. SPARQL

 

Open Data is made available increasingly by government authorities and local authorities. The extent to which datasets are used depends on their findability and the corresponding metadata. In the Open Data Portal Germany (OPAL), metadata is semantically annotated (RDF/Semantic Web/Linked Data) and enriched. The data is described using the Data Catalog Vocabulary (DCAT). The resulting data can then be used further by humans and machines. Currently, there are about 800,000 datasets in the OPAL database.

In this Bachelor Thesis a Social Bot has to be developed. It will provide information about suitable data sets in Social Media when respective questions are asked. The implementation of the bot will be modular, so the core module is independent from a concrete network. In addition, a suitable bot module should be developed for at least one network, e.g. Twitter or Slack. Current Question Answering technologies such as HAWK, DBpedia chatbot, or TeBaQA will be applied and adapted to the concrete requirements of database and queries.

Requirements: Java
Focus of work: Implementation, research and comparison of existing QA approaches, RDF, SPARQL if necessary

 

Contact: Adrian Wilke

Coherence measure evaluation (Bachelor)

Coherence measure evaluation (Bachelor)

For evaluating the quality of word sets, coherence measures can be used. There are several coherence measures available and an empiric evaluation of more than 200k measures has been carried out using the Palmetto project. The details are described in "Exploring the Space of Topic Coherence Measures".

The focus of this thesis is threefold:

  1. Enhance the Palmetto software to increase the speed of the coherence calculation when using more than one coherence measure.
  2. Repeat the evaluation on a new Wikipedia dump (used as reference corpus in the figure above).
  3. Add confirmation measures that have not been taken into account, e.g., PPMI, PMI², PMI³, etc.
    (This third point can be extended to increase the impact of the thesis)

Supervisor: Michael Röder

Related projects:

Color the LOD cloud (Master)

Color the LOD cloud (Master)

The Linked Open Data (LOD) cloud is a set of interlinked RDF datasets that are publicly available. From time to time a figure of the cloud is published and in this figure, the datasets might be colored based on the content of the single datasets. However, the categories as well as the assignment of the single datasets to the single categories are created manually by humans who read descriptions of the datasets or take a look into the data. The goal of this thesis is to create an approach that does this in an automatic, unsupervised way to enable humans an easier access to the LOD cloud. This comprises several steps:

  1. Extract data (or use meta data) from the datasets
  2. Calculate a topic model (e.g., by using Latent Dirichlet Allocation) and assign probabilities to the single datasets.
  3. Filter the topics based on their quality (e.g., calculated by Palmetto)
  4. Generate short labels for the single topics

Supervisor: Michael Röder

Runtime-Compile Tensor Expressions (Bachelor)

For our tensor-based RDF Triple Store we developed a highly flexible data structure based on advanced C++ Meta programming. When running queries multiple times, operator tree can be reused. An additional performance gain can be expected, when taking statistics from a previous run and compile an operator tree. This way, the second run of a query benefits from a compiler optimizations in the operator tree.

The student will implement three parts:

  • gather statistics from a run of a greedy just-in-time query planner
  • generate code for a hard-coded operator tree
  • compile and load the hard-coded operator tree and use it

The student will evaluate costs and benefits of compiling with respect to overhead, speedup and scalability. The thesis includes a theoretical discussion of the data structure and query processing.

Requirement: solid modern C++11/14/17 skills

Supervisor: Alexander Bigerl

Online Query-Planning for SPARQL Features (1x Bachelor / 2x Master)

For our tensor-based RDF Triple Store we developed a highly flexible data structure based on advanced C++ Meta programming. Currently it allows to run basic graph patterns with or without distinct. The student will extend it by adding (a, Master) filter and functions or (b, Master) aggregates or (c, Bachelor) optional.

Theoretical work:

  • The student will define a new operator in the query graph that handles the selected feature. Alternatively, an existing operator may be modified to reach the goal.
  • The student will extend an existing metric for choosing the best operator greedily.

The student implement three parts:

  • adding the selected feature to the query parser and internal parsed query representation
  • implement the operator
  • implement the extended metric

The student will evaluate the performance of the implemented feature against other triplestores ( e.g. Fuseki, Virtuoso, BlazeGraph). The thesis includes a theoretical discussion of the data structure and query processing.

Requirement: solid modern C++11/14/17 skills

Supervisor: Alexander Bigerl

Integration and Lifting of Question Answering Datasets (Bachelor)

TopicSupervisorContact
Integration and Lifting of Question Answering DatasetsDaniel VollmersDaniel Vollmers

Currently, there are more than 30 datasets from over 20 years research. All these datasets are in different formats and forms and their Question-Answer pairs can only be answered on specific underlying datasets.

In this thesis, the student will analyse the features of all these datasets and propose a solution to lift and access this benchmark to 5-star data (http://5stardata.info/en/). Answers will be grounded in knowledge bases via machine learning methods. Finally, the lifted datasets will be integrated into the renowned framework, GERBIL QA (http://gerbil-qa.aksw.org/gerbil/).

Source Code: https://github.com/dice-group/NLIWOD/tree/master/qa.datasets

Disclaimer

For most theses, the required skills include good knowledge of the Java, C++ or Python programming languages and the willingness to delve into exciting research. Students will be provided with the opportunity to impact the whole of the Semantic Web and Data Science community. Furthermore, we will offer close supervision during the writing of your thesis.

The general structure of writing a thesis in the DICE group is described at https://dice-group.github.io/theses/. The development will be carried out using Git in a Scrum-like setting. If you do not find a topic that fits your interests, you can also have a look at our github repository (https://github.com/dice-group) to get some additional impressions and send us your ideas!

If you have participated in our open theses event, you can use this form to contact us.

527efb333