How to Train Your Large Language Model 2 (HTYLLM2)

Project Group Master

The DICE Group has been actively involved in the development and application of Large Language Models (LLMs) across various fields. Following the successful publication of our first massively multilingual LLM, LOLA (https://aclanthology.org/2025.coling-main.428/) we are now aiming to scale our research to cover even more languages and modalities.

We also completed one full iteration of this Project Group that started in Summer Semester 2025. The work from that iteration, datasets, code, experiments, and lessons learned, will be directly available to the incoming students, giving them a solid foundation of knowledge to build upon.

For a detailed overview of the previous project group, see their conclusion slides: Final_Presentation_HTYLLM_PG_SoSe_25.pdf.

With this background, the current PG offers a unique opportunity to collaborate on developing the next generation of multilingual and multimodal language models. The project will push the boundaries of current LLM capabilities while providing hands‑on experience in cutting‑edge Natural Language Processing (NLP) and Machine Learning (ML) techniques.

Project Goal

Our project group aims to train a large, open-source multilingual language model and address the challenges posed by the curse of multilinguality. Specifically, our goals include:

Support 500+ Languages: Ensure the model can handle a wide range of languages from different linguistic families.
Ensure Computational Efficiency: Optimize the model to run efficiently by exploring sparse architectures.
Enable Multimodal Capabilities: Integrate support for multiple modalities such as text, images, and audio.
Maintain Linguistic Extensibility: Design the model to be easily adaptable to new languages and linguistic features.

For more information, check out the slides: HTYLLM2_PG_SoSe_26.pdf.

FAQs

Q: What is the selection process for this project?
A: Candidates will need to submit an assignment and undergo an interview as part of the selection process.

Q: Is there a seminar connected to this PG?
A: No.

Q: What are the prerequisites for this PG?
A: The ideal candidate should possess foundational knowledge in NLP and ML, along with strong programming skills in Python and shell scripting. Additionally, proficiency in Linux is essential. The ability to learn quickly and adapt to new technologies and methodologies is also critical as the PG domain is expected to have steep learning curve.

In case you have further questions, feel free to contact Nikit Srivastava.

Course in PAUL

Coming soon