Large Language Models
and the Estonian Language

State of the Art

Maciej Eder

2025-11-19

DigiTS

Digital Text Scholarship | Digitekstide uurimiskeskus

  • hosted by the Institute of Estonian and General Linguistics
  • Horizon Europe ERA Chair grant (#101186601)
  • supported by the European Union with €2.5m
  • 5 years (March 2025 – February 2030)

objectives

  • Objective 1: To establish an excellent international research team for DigiTS.
  • Objective 2: To conduct excellent internationally visible research in text-based DH in collaboration with related institutes of UT.
  • Objective 3: To improve the quality and diversity of DH teaching and support at UT, training future DH scholars, helping collaborating units in implementing computational methods for text analysis and educating future GLAM professionals.
  • Objective 4: To contribute to the development and management of the text-based data infrastructure for Estonian LLMs. 👈
  • Objective 5: To ensure sustainability of DigiTS results.

the team

ERA Chair professor

  • ERA Chair and Visiting Professor in Digital Humanities (Institute of Estonian and General Linguistics, UT)
  • Director Professor of Linguistics at the Institute of Polish Language (Polish Academy of Sciences)
  • Professor of Literature, Pedagogical University of Kraków
  • expert in computer-assisted text analysis using machine-learning and neural networks
  • PhD in literature, habilitation in linguistics
  • PI of the Horizon2020 project Computational Literary Studies Infrastructure (CLS INFRA)
  • author of the software package stylo for R
  • Tartu Linnamaraton finisher (21k)

project management

  • Liina Lindström, Professor of Modern Estonian (PI)
  • Maciej Eder, ERA Chair and Visiting Professor of Digital Humanities
  • Joshua Wilbur, Lecturer in Digital Linguistics
  • Loone Vilumaa, Project Manager

research team

  • Maciej Eder, research group leader
  • Kristiina Vaik, Research Fellow in Digital Humanities
  • Botond Szemes, Research Fellow in Digital Humanities
  • Thiago Dumont Oliveira, Research Fellow in Digital Humanities
  • Bhumika Bhattacharya, Junior Researcher
  • Sofia Kriuchkova, Junior Researcher
  • student assistants (to be appointed)

international advisory board

  • Prof. Karina van Dalen-Oskam, Huygens Institute (KNAW) and University of Amsterdam
  • Prof. Christof Schöch, University of Trier
  • Prof. Ray Siemens, University of Victoria
  • Assoc. Prof. Nina Tahmasebi, University of Gothenburg
  • Dr. Artjoms Šeļa, Czech Academy of Sciences

a selection of DigiTS work packages

  • Achieving Excellence in Text-Based Digital Humanities
  • Digital Humanities Education and Support
  • Infrastructure Creation and Stakeholder Engagement
  • Dissemination and Communication

how to find us

Welcome to the LLM Extravaganza!

What’s an LLM?

  • Large
  • Language
  • Models

Think of it as a super-smart robot that understands and generates human language. But instead of being a Terminator, it’s more like a friendly chatbot with a PhD in linguistics.

Fun Fact: LLMs can write poetry, solve math problems, and even tell jokes (though sometimes the jokes are as bad as mine).

Why Should You Care?

  • Linguists: Ever wanted a tool that can analyze vast amounts of text in seconds? LLMs got you covered.
  • Computer Scientists: Imagine training models that can understand context and generate coherent responses. Mind-blowing, right?

Fun Fact: LLMs can help you write better code comments (or at least try to).

The Magic Behind the Curtain

  • Training: Feeding LLMs with tons of text data.
  • Fine-Tuning: Teaching them specific tasks like translation or summarization.
  • Inference: Making predictions based on what they’ve learned.

Fun Fact: Training an LLM is like teaching a toddler to talk, but the toddler has read every book in the world.

What to Expect Today

  • In-depth Talks: On various applications of LLMs.
    • Eleri Aedmaa & Kristel Uiboaed (EKI), Advancing Language Resources: Infrastructure, Data Collection at Scale, and Benchmarking Strategy.
    • Kairit Sirts (UT), Improving Estonian Language Capabilities in Open LLMs: Opportunities and Challenges.
    • Mark Fišel (UT), Large Models, Small Data.
  • Q&A Session: To clarify any doubts you might have.
  • Round Table Discussion: Meet fellow enthusiasts and experts.

Fun Fact: You might even learn how to teach an LLM to write a conference paper (just kidding… or am I?).

authorship note

  • the 6 final slides were produced by a model
    • mistral-small 20b
    • run locally on a laptop
    • 24Gb of RAM sufficient to run it
    • ollama interface involved