The Data Science Suite PG aims at covering the linked data lifecycle from gathering (e.g., crawling) data, over extracting structured information, indexing and storing the structured data as well as cleaning and checking the data before presenting it to the user in an interactive way. Since the PG covers several fields, we offer several sub topics which are covered by smaller groups.

Since Data Science Suite II is aiming in the same direction, the students of these two PGs will work very closely together. This offers the possibility to join a team in which students already have several months experience.

In the following, the single subgroups that we plan to offer during the next semester are listed as well as the evaluation criterias for participating students (Note that both is not final and might be subject to change).

Crawling

Problem: Data is distributed across the Linked Open Data cloud
Solution: Gather the data from the cloud
Goal: Improve our Linked Data Crawler

To gather data from web resources, a crawler is needed. Such a crawler downloads resources from given URIs, stores the datat into a triple store and extracts new URIs which are pointing to other resources which haven't been crawled before. Our group has developed a linked data crawler named Squirrel. It works in a distributed environment by making use of AMQP message queues and Docker containers. This crawler will be further enhanced by the students.

Data Portals

Problem: Datasets are insufficiently described.
Solution: Detect similarities in metadata records, e.g. geospacial data or synonyms in textual descriptions.
Goal: Analyze metadata records, generate tags and relations as well as a filtering based on a given entity.

During an initial crawling, we are gathering metadata records about datasets. In the best case, a record consists of complete and correct data with a clear name and a long description. Unfortunately, this is not the case for all datasets. In one of our research projects, we gathered hundreds of thousands of metadata records which should be further processed and enhanced by extracting entities and relations (see figure below). Additionally, a demo should be created which is able to ease the search in this large amount of datasets, e.g. by geospacial entities like countries or cities.

Storage

Problem: Execute SPARQL queries quickly
Solution: Use algebraic operations on sparse tensors to process SPARQL queries on RDF.
Goal: Improve average query runtime and memory footprint.
Tasks: We developed a triple store based on tensor algebra dubbed Tentris. It stores RDF as a three dimensional tensor and executes SPARQL queries by applying algebraic operations to this tensor. The students will work on improving the existing Tentris implementation. The tasks may span from implementing a visual interface to the underlying tensor calculus to improvements of the actual tensor data structure and speedup of tensor expression processing.
Prerequisites: The students should bring some skills in writing modern C++ 11 / 14 / 17  / 20 code and build it with CMake. Alternatively or additionally, advanced web development for developing modern web apps is desirable. Solid knowledge of linux and git is a must.
Previous knowledge about tensors is not required.
Good students may extend their work in the Project group to a master thesis.

Fact Checking

Problem: Data might not be reliable
Solution: Automatic fact checking
Goal: Integrate existing solutions into one single framework/demo

Semantic web applications typically rely on facts formulated as triple. Every available triple is treated as a true fact. However, not all of the triples gathered by our crawler might be true. Some facts might be wrong, e.g., because of human errors. Several systems for checking facts exist. Some of them are relying on textual evidences (e.g., FactCheck) while others are using knoweldge bases to confirm or refuse a given fact (e.g., COPAAL). The figure below shows such a fact (the dotted arrow) and other facts which are supporting it. Such systems should be integrated into a single, micro service-based framework. Additionally, the COPAAL demo should be extended to be able to work with all of the integrated fact checking systems.

Intelligent Data Assistant

Problem: Poor QA and conversational solutions to access the available data
Solution: Intelligent data-driven assistant
Goal: Enable intelligent assistant to address more complex requests

Although there is a lot of data available, the access to the data is mostly restricted to data analysts that can make use of special software and know the complex algorithms necessary to present the data in an understandable way. The goal of this subgroup is to give access to these algorithms for normal users by using an intelligent assistant which guides the user through the process.

Benchmarking

Problem: Data-driven build of pipelines
Solution: Automatic benchmarking
Goal: Improve data-driven decision making

The performance of our approaches need to be measured. Therefore, we developed the benchmarking platform GERBIL. While it started as a benchmarking platform for knoweldge extraction approaches, the project diverged into several other directions―namely question answering and fact checking. The goal of this subgroup is to develop GERBIL 2.0―a more general benchmarking framework which aligns the efforts of the different existing GERBIL branches. This goal will be reached by classical software development and engineering processes starting from a list of requirements and a list of available functionalities to the development of concepts in UML before diving into the codewriting itself following test driven development.

Evaluation

The goal of a student joining the project group should be:

Be a valuable member of your team!

This mainly includes (but might not be limited to) the following aspects which are considered for evaluating the students’ performance:

  1. Performance/Code Quality: The students will choose different tasks that they tackle within their respective subgroup. The performance of the students regarding these tasks is measured. Typically, the tasks result in a piece of code that is written. However, the tasks are not limited to that and other results (e.g., diagrams, concepts, etc.) are taken into account as well. These results and especially the code of each student will be reviewed according to a) its functionality, b) its quality and c) its documentation. 
  2. Communication/Presentation:  The communication of the students within the group is of central importance. Successful teamwork is not possible without it. Therefore, the communication of the students with their team members as well as the communication with their supervisor (during regular meetings, via mail, etc.) will be evaluated. Additionally, the subgroups will present their respective work within a presentation day with allocated time slots once a semester. The presentations will be evaluated and the attendance to the presentation day is mandatory.
  3. Management/Report: Although the supervisor will give hints regarding the students work, the students are mainly responsible for managing the project. Therefore, the students will select a project leader (or group speaker) which will be the main contact point for the supervisor. This leader should make sure that the group follows an agile management (note that responsibility does not mean that this student has to do everything by him/herself—tasks can be delegated). At the end of each semester, each student will submit a short report on the current state of their work (~4 pages).
527efb333