This is a three-year collaborative research project between Group members from University of Antwerp and Institute of Polish Language PAS, funded by Research Foundation of Flanders (FWO) and the Polish Academy of Sciences (PAS). The collaboration is aimed at the introduction and adaptation of deep learning methods to computational stylistics, with an emphasis on author identification, and further integration of Kraków and Antwerp computational stylistics communities.

More about the project

In the proposed collaboration, we aim to turn our attention to “deep” representation learning in order to improve computational methods for the robust stylistic analysis of short documents (< 1000 words). Deep learning is a specific form of machine learning in artificial intelligence that makes use of a learning architecture that is called neural networks. These networks are customizable functions that can learn to transform input data (e.g. an image) into a desirable output (e.g. a person identification). While these networks require substantial computing power and very large example sets (i.e. training data to optimize them), this technology has led to significant breakthroughs across many domains in the past decade, such as audio engineering, computer vision and natural language processing. Although this technology is nowadays also emerging in Humanities research, it is surprising how few applications have been reported so far in the domain of authorship attribution. The little research that has been published in this domain focuses on micro-blogging data (e.g. Twitter posts) and is hard to extrapolate to longer documents. We hypothesize that the lack of research in this direction is caused by (a) the lack of larger training data sets in the field and (b) the general lack of competence in neural networks.
Seminal research by Bagnall (2016) suggests that character-level language modelling is a feasible approach to stylistic analysis. This method has been submitted to recent competition and its excellent performance suggest that it is relatively robust in the cross-domain setting. The method, however promising, has not been explicitly described however and since its initial application, it has not been followed up in the field. In this collaboration we aim to reconstruct, improve and stress-test this method, and so improve the performance of methods for the analysis of short documents (< 1000 words). The project will be organized around annual cycle of the PAN lab, a shared task on authorship attribution that is each year collocated with the CLEF conference, with teams meeting for joint workshops twice a year and arranging longer exchanges of early career researchers. Through participating each year in the lab’s competition, the team will have the chance to regularly monitor the improvement of the system in an objective manner.

Team

Coordinators:

  • Mike Kestemont - University of Antwerp
  • Maciej Eder - Institute of Polish Language PAS

Flemish side:

  • Walter Daelemans – professor in computational linguistics, research director of the Computational Linguistics and Psycholinguistics Research Center (CLiPS)
  • Dirk van Hulle – professor in English literature, director of the Center for Manuscript Genetics, Antwerp Centre for Digital Humanities and Literary Criticism (ACDC)
  • María del Rocío Ortuño Casanova – doctor-assistant in Spanish literature, Antwerp Centre for Digital Humanities and Literary Criticism (ACDC)
  • Enrique Manjavacas – PhD student, Antwerp Centre for Digital Humanities and Literary Criticism (ACDC)
  • Jeroen de Gussem – PhD student, Antwerp Centre for Digital Humanities and Literary Criticism (ACDC)
  • Wouter Haverals – PhD student, Institute for the Study of Literature in the Low Countries

Polish side:

  • Rafał L. Górski - associate professor in linguistics, Head of Methodology Department, Institute of Polish Language PAS
  • Michał Woźniak - postdoctoral researcher in linguistics, Institute of Polish Language PAS
  • Joanna Byszuk - PhD candidate, Institute of Polish Language PAS
  • Albert Leśniak - PhD candidate, Institute of Polish Language PAS
  • Wojciech Łukasik - PhD candidate, Institute of Polish Language PAS

Project development

This section will offer updates on activities within the project.

August 2019 - Wouter Haverals visits Institute of Polish Language PAS.

Early August, 5-16th, saw the first study visit of our project - Wouter Haverals from University of Antwerp spent two weeks with Kraków team, working on his project examining possibilities of using a text’s rhythmical pattern as a feature for author attribution. Specific case in interest were works assumed to be authored by Henry of Brussels, and more details of the project can be found in Wouter’s repository. The visit was inspiring and productive to all involved, and crucial to tightening relationships between both project teams.

November 2019 - Joanna Byszuk visits University of Antwerp, first project workshop.

In the beginning of November, 4-15th, Joanna Byszuk from Institute of Polish Language PAS spent two weeks with Antwerp team, working on her project concerning multimodal-audiovisual stylometry, and engaging with the project on audio classification solutions developed in CLiPS.
Joanna’s visit was followed by a workshop involving both teams, also held in Antwerp, between 25th and 28th of November. Its goals were to build teams dedicated to specific subtasks of the project and launching them. We have divided into two (overlapping) groups working on solutions related to multilingual direct speech recognition (as part of our support to the COST Action Distant Reading in which some of us participate) and cross-domain authorship attribution.
As work continues, we hope to present results of all these efforts to the world fairly soon.

March 2020 - Direct Speech project update

While Coronavirus outbreak delayed our next meeting, we have successfully continued developing our Direct Speech project on the distance. In February, during annual Cost Action meeting we held a workshop detailing early progress. We also got accepted for LT4HALA workshop at LREC conference. While the conference will not take place, its proceedings will be published. You can read the paper preprint here or final version pp. 100-104 here.

The project is now described in detail on a separate page.