Corpora

The following selection of links is but a tip of the iceberg when it comes to the corpora (text collections) suitable for text analysis. The corpora listed below, however, are compiled by the members of CSG, and checked for compatibility with commonly known stylometric software.

If you’re looking to test the methods on poetry or drama, we are also happy to recommend befriended projects:

  • PoeTree - “a standardized collection of poetry corpora comprising over 330,000 poems in ten languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Russian, Slovenian, Spanish). Each corpus has been deduplicated, enriched with Universal Dependencies, provided with additional metadata and converted into a unified JSON structure.” PoeTree is a project headed by our colleagues from the Czech Academy of Sciences and is continuously growing.
  • DraCor - “(short for »drama corpora«) is an open digital infrastructure developed for the computational study of (mostly) European drama from Greco-Roman antiquity to the 20th century.” The project includes 21 and counting corpora of dramatic works with rich encoding facilitating analysis of various levels of drama. Like in the case of PoeTree, the corpora are continuously growing.

Documentation of the package ‘stylo’

  • for (real) beginners: a crash introduction in the form of a slideshow
  • for (sort of) beginners: a concise HOWTO
  • for advanced users: a paper in R Journal
  • full documentation at CRAN

Blog posts on non-obvious functions of the package ‘stylo’:

Method discussions:

Video introductions

Publications

A list of relevant publications by the CSG members can be found on this website, on the subpage ‘publications‘. However, a comprehensive Stylometry Bibliography, curated by Christof Schöch, is definitely a place to consult before starting any experiment in text analysis.

Learn with us

The members of the group regularly conduct invited workshops at various places of the world, including yearly course offerings at Digital Humanities Summer Institute (DHSI) in Victoria BC and The European Summer University in Digital Humanities (ESUDH) in Leipzig. Below we aim to list some upcoming events:

2025 major workshops

2024 major workshops

2023 major workshops

2022 major workshops

  • 2–12 Aug European Summer University in Digital Humanities in Leipzig, Germany. Taught by Maciej Eder and Jeremi K. Ochab.
  • 22–24 March COST Action Winter School in Belgrade. Taught (remotely) by Joanna Byszuk, Artjoms Šeļa and Maciej Eder.

2021 major workshops

2020 major workshops

2019 major workshops