Digital Humanities and AI:
A Showcase

Maciej Eder

University of Tartu | Polish Academy of Sciences

2026-05-11

not only chatGPT

chatbot

another chatbot

traditional models not sufficient

multivariate machine learning (e.g. SVM) fails to map a large feature space:

turning points

1970s – artificial neural networks
2000s – deep learning networks
2013 – word embeddings (word2vec)
2014 – sequence mapping (seq2seq)
2017 – attention is all you need (The Transformer) 👈
2018 – pre-trained multi-language model (BERT)
2019 – large generative model (GPT-2)
2022 – a model released as a chatbot (chatGPT)
2023 – multimodal models
2024 – reasoning models
2025 – compressed models (distillation, quantization)
2026 – agentic systems (OpenClaw, Hermes-Agent)

the Transformer

artificial neural network

deep learning neural network

encoder-decoder neural network

transformer neural network

sequence to sequence mapping

from audio to text

from text to audio

from text to image

Universe, LSD, Fractal Worlds, Eyes

the same prompt, different results

machine translation

Large Language Models

LLM, LRM, GPT, BERT

BERT: Bidirectional Encoder Representations from Transformers
GPT: Generative Pre-trained Transformer
- not only the architecture
- but also the trained parameters
- massive amounts of textual data used
LLM: Large Language Model
- a group of modern text-oriented models
LRM: Large Reasoning Model
- a group of models involving “the chain of thoughts”

model size comparison

a small selection of LLMs

Model	Release	Parameter Count	Training Data
Opus 4.7	2026	>1.6 trillion	almost the whole internet
GPT‑4	2023	>1 trillion	web + proprietary
DeepSeek-R1	2025	671B	>85,000 agent tasks
LLaMA 2	2023	7B, 13B, 70B	public corpora
Mistral 7B	2023	7B	public data
BERT	2018	110M	Wikipedia + BookCorpus

😲 recent LLMs are over 10,000 times bigger than BERT

translation to sign language

aim of the project

to provide a system to automatically translate
- written language to sign language
- originally from Polish to PSL (PJM)
- later: Ukrainian to USL
capable of translatin any input text
prototype: public administration domain
plans: the system used by public institutions

avatar modeling with Unreal

Kristofer the Avatar

mock-up gesture capturing

sign language represented as glosses

A sentence in a phonic language:

I will put a book on a table.

Vs. a sentence in a sign language:

BOOK, TABLE, PUT-ON

Not only SVO vs. SOV, but also the number of words differ.

(Also, sign languages use space, and have non-manual gestures…)

translation as a mapping problem

the Transformer again

Deep learning neural network
Based on the multi-head attention mechanism
Context-aware
Suitable for language data (i.e. linear order matters)
Designed to solve machine translation problems (!)

data scarcity problem

AI models have to be fed with lots of data
In our case, only 800 sentences available
Two approaches to overcome the issue:
- Data Augmentation (synthetic datasets to rescue)
- Transfer Learning (train the model, and then fine-tune)

800 manually annotated sentences

Deutsche Gebärdensprache (DGS)

overcoming data limitations

Transfer Learning:

training a model on a big yet genral dataset
- e.g., on 80,000 sentences from DSG corpus
fine-tuning using a target dataset
- e.g. the 800 sentences in PJM

Data Augmentation:

creating a synthetic dataset
by artificially copying original sentences…
… with some random modifications introduced.
finally, training a model on the augmented dataset

does it work?

printing centers
in the 16th-19th centuries

aim of the study

to map the printing market in 16th-19th centuries
to corroborate a centralization (?) of publishing houses
to observe the shift (?) from Krakow to Vilnius, and then to Warsaw

the dataset

there exists a comprehensive biblography of all the prints anyhow related to Poland (i.e. the language, the place of publication, or the content decide)
compiled by Karol Estreicher at the turn of the 19th century
completed by his son, and then by his grandson
ca. 250,000 books recorded for 16th-18th centuries
ca. 140,000 books recorded for 19th century
available in print, available as a database (?)

the bibliogrphy: a sample title page

the bibliogrphy: a sample page

a database exists, but…

unstructured data in the wild

Cracoviae ex Off. Hier. Szarffenbergii. A. 1549.
Typis Univ. Zamoscensis. A. 1748.
w Łowiczu 1782.
S. Pietierburg, w tip. Wtorago Otdielenija Sobstwiennoj Jego Imp. Wieliczestwa Kancelarii, 1849,
Frankfurt und Leipzig 1728.
W drukarni Lwowskiey Soc. Jesu {b. r. 1746}.
Lemberg, 1888,
V Praze, tisk a sklad c. k. knihtiskárny Synů Bohumila Haase, 1852,
Bromberg, Louis Levit, gedruckt bei C. L. Gasse, 1844,
Dantisci 1644.
München, Druck, Franz Paul Ercacher, (ok. 1895),
1565.
Vindobonae, typ. Ueberreiter, 1840,
Lwiw, tszczanijem, iżdywenijem i typom Instytuta Stauropyhijanskaho pry Cerkwi Usp. Pr. Bohorod., 1857,
Posiedzeń 10,
Danzig, Verlag von Th. Bertling, Druck von A. W. Kafemann, 1860,
Anno M.DC.LXXXV. (1685). Crac: Typis Francisci Cezary, S. R. M. Typ.
Gedruckt zu Leiptzig, M. D. LXXVI (1576),
Dorpat, bei C. A. Kluge; Leipzig bei C. F. Köhler, gedruckt bei J. C. Schünmann in Dorpat, 1836,
Gedruckt zu Dantzigk, durch Jacobum Rhodum. M. D L XXX (1580),
Gdańsk, druk. wdowy Jerzego Rhete, 1649.
Typis Academiae Posnaniensis (1698).
In Venegia appresso Gabriel Giolito de Ferrari MDLXI (1561).

LLMs to rescue

manual data extraction unrealistic for ~400,000 entries
LLMs capable of classifying, denoising, translating etc.
LLMs good in detecting pattenns in unstructured data
however, the dataset cannot be just fed into a LLM
but: the data can be split into batches
batches of 50 entries sent to the model, one at a time

the prompt

PROMPT_INTRO = """You are an expert librarian, with a profound expertise in Polish prints 
from 16th-19th centuries.

I will give you bibliographic entries divided into three TAB-separated fields:
Author[TAB]Title[TAB]Publication info.

Extract ONLY place of publication and year of publication.
Prioritize place/year inside parentheses if present.
Convert city names to modern Polish spelling when possible (e.g., Breslau/Vratislavia -> Wrocław; 
Lemberg -> Lwów).
If missing, output "-".

OUTPUT RULES:
- Output EXACTLY one line per entry, numbered 1..N.
- Each line must be: "i. Place: <Place>, Year: <Year>"
- No extra text.

Entries:
"""

the system works nicely

Place: Mülheim a. d. R., Year: 1871-1876  
Place: Warszawa, Year: 1881  
Place: Warszawa, Year: 1881  
Place: Toruń, Year: 1882  
Place: -, Year: -  
Place: Warszawa, Year: 1895  
Place: -, Year: -  
Place: Kraków, Year: 1892  
Place: -, Year: -  
Place: -, Year: -  
Place: Kraków, Year: 1874  
Place: Lwów, Year: 1848  
Place: -, Year: 1870  
Place: -, Year: -  
Place: Warszawa, Year: 1831  
Place: Lwów, Year: 1848  
Place: Kraków, Year: 1848  
Place: Lwów, Year: 1848  
Place: Warszawa, Year: -  
Place: -, Year: -  
Place: Lwów, Year: 1848  
Place: Paryż, Year: 1843  
Place: Kraków, Year: 1900

sometimes, LLMs seem confused

The given input describes a large set of scenarios for the zombie apocalypse simulation. For each 
of the 12 test cases, a valid path from the upper-left corner to the lower-right corner of the 
50 × 50 grid is not found while avoiding both the obstacles and the attack ranges of the zombies. 
Consequently, the output for every test case is a single line containing `-1`.

-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1

Place: -, Year: -  
Place: -, Year: -  
Place: -, Year: -  
Place: -, Year: -  
Place: -, Year: -

some replies are werid…

#### Correctness Proof  

We prove that the algorithm outputs the correct value for every query.

---

##### Lemma 1  
During the processing of a query the variable `index` equals the
binary number whose bits are exactly the bits encoded by the query
(`1` → `1`, `2` → `0`), read from the first to the last integer.

**Proof.**

*Initialization.*  
Before the first integer is processed `index = 0`.  
This is the value of a binary number with no bits – the empty prefix.

*Induction step.*  
Assume after reading the first `k` integers (`k ≥ 0`)
`index` equals the integer represented by the first `k` bits.
When the `(k+1)`‑st integer `d` is read,
the algorithm shifts the current value left by one (`index << 1`)
and OR‑s with the new bit `bit` (`0` if `d = 2`, `1` if `d = 1`).
Thus the new value represents the binary number whose prefix
consists of the first `k` bits followed by the `(k+1)`‑st bit.

the eureka moment

PROMPT_INTRO = """You are an expert librarian, with a profound expertise in Polish prints 
from 16th-19th centuries.

I will give you bibliographic entries divided into three TAB-separated fields:
Author[TAB]Title[TAB]Publication info.

Extract ONLY place of publication and year of publication.
Prioritize place/year inside parentheses if present.
Convert city names to modern Polish spelling when possible (e.g., Breslau/Vratislavia -> Wrocław; 
Lemberg -> Lwów).
If missing, output "-".
Take the entries one by one carefully to avoid confusion.
Expect to process exactly 50 entries.

OUTPUT RULES:
- Output EXACTLY one line per entry, numbered 1..N.
- Each line must be: "i. Place: <Place>, Year: <Year>"
- No extra text.

Entries:
"""

results

results: number of printed books

results: centers vs. peripheries

results: printing centers

Small & tidy vs. Big & dirty

it would have been great to have it all, but…
messy dataset better than no dataset
in the Humanities, we are obsessed with high quality
however, a bigger picture only possible when the dataset is big too!

but…

the model gpt-oss:120b outperformed competition (so far)
on April 2, 2026: the Gemma4 family released:
- Gemma 4 26b
- Gemma 4 31b
- Gemma 4 e4b
- Gemma 4 e2b
on April 3, 2026: the Qwen3.6 family released:
- Qwen 3.6 27b
- Qwen 3.6 35b

evaluation

benchmark design

1,000 entries picked at random
annotated manually by 2 people
presented to a number of models
- the same prompt
- the same batch size
- the same order of batches
compared with manual annotation

performance comparison

	Place	Year		laptop	local	cloud
gpt-oss:20b	0.890	0.856		✅	✅	✅
gemma4:26b	0.840	0.854		✅	✅	❌
qwen3.6:35b	0.940	0.959	🏆	🐌	✅	❌
gemma4:31b	0.929	0.951	👀	🐌🐌	🐌	✅
qwen3.6:27b	0.945	0.953	💪	🐌🐌	🐌	❌
deepseek:70b	0.646	0.422		❌	✅	✅
gpt-oss:120b	0.924	0.713		❌	✅	✅
gpt5.4	???	???	🥇	❌	❌	✅

LLMs run locally

easier than you think

get a program to run LLMs, e.g.: https://ollama.com/
pull a model of your choice, e.g.:
- gemma3n – multilingual, designed to work on older laptops
- deepseek-r1:8b – reasoning model, yet still relatively small
- qwen3.5:9b – a new kid on the block, reasoning while compact
- mistral-small – bulky (13Gb on disk), but still can be run locally
- gpt-oss:20b – might require a recent laptop, but still installable
- cogito:70b – a monster desktop computer should be able to handle it
- . . . and dozens of other models.
run in a chatbot mode (known from chatGPT etc.)
or process your dataset in batches.

Digital Humanities and AI: A Showcase