This is a three-year collaborative research project between Group members from University of Antwerp and Institute of Polish Language PAS, funded by Research Foundation of Flanders (FWO) and the Polish Academy of Sciences (PAS). The collaboration is aimed at the introduction and adaptation of deep learning methods to computational stylistics, with an emphasis on author identification, and further integration of Kraków and Antwerp computational stylistics communities.
More about the project
In the proposed collaboration, we aim to turn our attention to “deep” representation learning in order to improve computational methods for the robust stylistic analysis of short documents (< 1000 words). Deep learning is a specific form of machine learning in artificial intelligence that makes use of a learning architecture that is called neural networks. These networks are customizable functions that can learn to transform input data (e.g. an image) into a desirable output (e.g. a person identification). While these networks require substantial computing power and very large example sets (i.e. training data to optimize them), this technology has led to significant breakthroughs across many domains in the past decade, such as audio engineering, computer vision and natural language processing. Although this technology is nowadays also emerging in Humanities research, it is surprising how few applications have been reported so far in the domain of authorship attribution. The little research that has been published in this domain focuses on micro-blogging data (e.g. Twitter posts) and is hard to extrapolate to longer documents. We hypothesize that the lack of research in this direction is caused by (a) the lack of larger training data sets in the field and (b) the general lack of competence in neural networks.
Seminal research by Bagnall (2016) suggests that character-level language modelling is a feasible approach to stylistic analysis. This method has been submitted to recent competition and its excellent performance suggest that it is relatively robust in the cross-domain setting. The method, however promising, has not been explicitly described however and since its initial application, it has not been followed up in the field. In this collaboration we aim to reconstruct, improve and stress-test this method, and so improve the performance of methods for the analysis of short documents (< 1000 words). The project will be organized around annual cycle of the PAN lab, a shared task on authorship attribution that is each year collocated with the CLEF conference, with teams meeting for joint workshops twice a year and arranging longer exchanges of early career researchers. Through participating each year in the lab’s competition, the team will have the chance to regularly monitor the improvement of the system in an objective manner.
- Mike Kestemont - University of Antwerp
- Maciej Eder - Institute of Polish Language PAS
- Walter Daelemans – professor in computational linguistics, research director of the Computational Linguistics and Psycholinguistics Research Center (CLiPS)
- Dirk van Hulle – professor in English literature, director of the Center for Manuscript Genetics, Antwerp Centre for Digital Humanities and Literary Criticism (ACDC)
- María del Rocío Ortuño Casanova – doctor-assistant in Spanish literature, Antwerp Centre for Digital Humanities and Literary Criticism (ACDC)
- Enrique Manjavacas – PhD student, Antwerp Centre for Digital Humanities and Literary Criticism (ACDC)
- Jeroen de Gussem – PhD student, Antwerp Centre for Digital Humanities and Literary Criticism (ACDC)
- Wouter Haverals – PhD student, Institute for the Study of Literature in the Low Countries
- Rafał L. Górski - associate professor in linguistics, Head of Methodology Department, Institute of Polish Language PAS
- Michał Woźniak - postdoctoral researcher in linguistics, Institute of Polish Language PAS
- Joanna Byszuk - PhD candidate, Institute of Polish Language PAS
- Albert Leśniak - PhD candidate, Institute of Polish Language PAS
- Wojciech Łukasik - PhD candidate, Institute of Polish Language PAS
This section will offer updates on activities within the project.
August 2019 - Wouter Haverals visits Institute of Polish Language PAS.
Early August, 5-16th, saw the first study visit of our project - Wouter Haverals from University of Antwerp spent two weeks with Kraków team, working on his project examining possibilities of using a text’s rhythmical pattern as a feature for author attribution. Specific case in interest were works assumed to be authored by Henry of Brussels, and more details of the project can be found in Wouter’s repository. The visit was inspiring and productive to all involved, and crucial to tightening relationships between both project teams.