Statistical Natural Language Processing

Lecture Master

Content

Humanity generate exabytes of data every year. Most of this data is available in some rendition of natural language (in particular text). Hence, the inclusion of textual data sources is of growing importance in large-scale data-driven applications. A popular application scenario for this use are personal assistants (Siri, Google Home, Cortana, etc.), which rely partly on Web pages to extract of select answers to user questions. Processing large amounts of text in a semantically sound manner however turns out to be rather difficult for machines. The goal of this lecture is to provide students with insights in approaches based mostly on probabilistic models, which aim to facilitate the implementation of pipelines for processing natural language text. The lecture is structured as follows:

Finite-state automata
Language models
Spell checkers
Deduplication
Classification
Hidden Markov Models
Grammar and semantics
Parsing natural language
Word Sense Disambiguation
Distributional semantics

Structure

The course consists of:

A lecture
2h/week, slides uploaded after the lecture
Six series of coding exercises
Evaluated automatically through an online platform. Students are required to reach at least 50% of the points and submit at least 60% of the exercises to be allowed to participate in the exam. The exercises are discussed during a bi-weekly seminar.
A mini-project
The goal of the mini-project is to apply the content of the lecture to a practical problem and to implement a non-trivial solution to said problem. Groups of up to 3 persons are allowed, as long as the portion of the work carried out by each student can be identified clearly. The solution is evaluated automatically on a benchmark against a non-trivial but baseline solution to the same problem. Students must outperform the baseline to be allowed to participate in the exam. Moreover, a short document (12-15 pages, written using the provided LaTeX template) explaining the solution implemented by the students and a link to clearly commented code are a prerequisite to complete this requirement for the exam.

Exam

The exam lasts 90 minutes. The students are expected to answer both theoretical questions (e.g., what are the time and space complexity of a particular algorithm) and practical questions (e.g., write a regular expression to extract all occurrences of “mouse” from a piece of text).

Course in PAUL

L.079.05702 Statistical Natural Language Processing (in English)

Discussion forum in PANDA

go.upb.de/snlp (or PANDA link)

Contact

Michael Röder