This is a 5-year research project founded by the National Science Centre (SONATA-BIS 2017/26/E/HS2/01019), carried out at the Institute of Polish Language (Polish Academy of Sciences). It is aimed to develop, evaluate and apply an innovative methodology of comparing large texts collections, in order to identify hidden patterns and similarities invisible to the naked eye. The official title of the project reads: “Wielkoskalowa analiza tekstu i metodologiczne podstawy stylistyki komputerowej”, which can be translated into English as: “Large-Scale Text Analysis and Methodological Foundations of Computational Stylistics”. Since the title is somewhat long, we dropped its first part; hence the handy version “Foundations of Computational Stylistics” (FoCS).
The project will involve a team of 5-6 people, and we are getting there quickly. This is us so far:
- Maciej Eder (PI)
- Joanna Byszuk (PhD candidate)
- Artjoms Šela (post-doc)
- Albert Leśniak (PhD candidate)
The assumed impact of the project is:
- a novel outlook on language and literary creation;
- a contribution to classification techniques in the domain of textual data and in large-scale tasks, including a reexamination of traditional periodizations and classifications of literature;
- contribution to visualization techniques;
- potential contribution to any task related to text analysis: spam filtering, sentiment analysis, plagiarism detection, authorship profiling, and information retrieval.
The project is divided into five related subprojects. Subproject A is planned to explore theoretical problems of stylometric tests, including reliability issues and visualization techniques; subproject B is aimed at testing the established methodology on large corpora of literary texts in different languages, genres, themes and authorial genders in the original and in translation; subproject C will examine grammatical features as indicators of stylistic variation; subproject D will discuss style differentiation across different modes, genres, and themes in the field of one particular literary tradition – namely Latin; and, finally, subproject E will explore usefulness of sequential analysis methods to examining local stylistic changes in texts.
First stages - 2019 update
Since the project launch in October 2018, we have investigated two methodological issues:
- selecting features in classification for authorship attribution,
- reliability of various methods of hierarchical and network clustering.
In the first task we examined various methods of selecting and ordering features, the problem which has been studied relatively rarely and with a focus on the first part of the question. We tested various methods of selecting, weighting and ordering features, eventually proposing a novel solution (to be revealed soon!) which provides a small improvement over exisiting ones.
In the second task we cooperated with two great colleagues, Jeremi K. Ochab from Jagiellonian University and Steffen Pielström from University of Würzburg. Our main question was: can we determine if hierarchical or network clustering produce more reliable results in stylometry? Which conditions influence that? We examined the problem using a number of benchmark corpora in a few languages, testing for the correlation between stability of results and a) linkage method, b) clustering algorithm, c) distance measure, d) number of features.
We are very happy to share that papers covering results of these studies were accepted as long paper presentations for the DH2019 conference! Look for our talks:
- “Feature Selection in Authorship Attribution: Ordering the Wordlist” by Maciej Eder and Joanna Byszuk.
- “Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)” by Jeremi K. Ochab, Joanna Byszuk, Steffen Pielström (Universität Würzburg) and Maciej Eder.
Our other contributions to DH 2019 are:
- “Challenging Stylometry: The Authorship of the Baroque Play La Segunda Celestina” by Laura Hernandez Lorenzo and Joanna Byszuk.
- “How To Detect Coup d’État 800 Years Later” by Jan Škvrňák (Department of History, Masaryk University in Brno), Michael Škvrňák (Department of Sociology, Charles University in Prague), and Jeremi K. Ochab.
- “The European Literary Text Collection (ELTeC)” by Caroline Odebrecht, Lou Burnard, Borja Navarro Colorado, Maciej Eder and Christof Schöch.
And to other conferences we presented at:
- “Attribution of Authorship for Medieval Persian Quasidas with Stylometry” by Joanna Byszuk and Alexey Khismatulin (Institute of Oriental Manuscripts, Russian Academy of Sciences), #Right2Left Workshop, 8 VI 2019, Victoria BC.
- “Stylometry of literary papyri” Jeremi K. Ochab, Holger Essler (Institut für Klassische Philologie, Universität Würzburg), talk at DATeCH
- “Stylometry of literary and documentary papyri” Jeremi K. Ochab, Holger Essler (Institut für Klassische Philologie, Universität Würzburg), poster at International Congress on Papyrology.
Next stages - 2020 update
In collaboration with the COST Action “Distant Reading” (or, to be precise, with Stefano Sbalchiero from the University of Padova, who visited us a few weeks using the Action’s STSM scheme) we attacked topic modeling methodology in order to test the optimal number of topics to be inferred from a corpus containg very long texts. Our paper “Topic modeling, long texts and the best number of topics: some problems and solutions” by Stefano Sbalchiero and Maciej Eder has just been accepted for publication in the journal Quality & Quantity.