Computational Stylistics Group | Using ‘Stylo’ with languages other than English

Can I conduct a stylometric analysis with any language in the world?

In theory – yes. In practice, there are a few things you should consider first, from examining problems specific to the language (e.g. some languages, like Chinese, work better with character n-grams rather than words) to building a big enough, well-balanced and formatted corpus.

Corpus

While it is a general good practice to use UTF-8 encoding, this is especially important if the language of your corpus uses Latin characters with diacritics, e.g. “ó”, “ö”, “ñ”, “ň”, or characters of entirely different systems, e.g. Chinese or Japanese. If you’re getting errors despite setting UTF-8 parameter in the GUI, check if all the files in your corpus are encoded as UTF-8. Various text editors offer various options, but you can usually check this while saving the file or in its format settings.

Languages in ‘Stylo’ gui:

As of 0.6.8 ‘Stylo’ gui allows selection between following languages:
English, Latin (also a variant treating u/v as u), Polish, Hungarian, French, Italian, Spanish, Dutch, German, CJK (Chinese, Japanese, Korean) and Other.

The Other option should work for all languages supported by ‘Stylo’, and usually there is no need to narrow it down to a specific language group. Various languages offered in the gui allow for the use of extra features such as pronoun deletion (feature available in the Features section, not recommended for most experiments).

If you want to conduct a stylometric analysis on the texts in a language we haven’t included so far you can add proper codepoints in the txt.to.words() function:

txt.to.words(input.text, splitting.rule = "[^PLACE_YOUR_CODEPOINTS_HERE]+", 
             preserve.case = FALSE)   

The ^ character in the class defined by the [ ] brackets means that you indicate the characters that you don’t want to delete. In other words, all characters that were not explicitly listed in the brackets will be used as word delimiters. You can also use splitting.rule parameter for other purposes, e.g. when you want to use other separator than space (‘Stylo’ default).

Frankly, in a vast majority of applications you don’t even have to know about the existence of the above txt.to.words() function. You simply pass your codepoints directly to the function stylo():

stylo(splitting.rule = "[^PLACE_YOUR_CODEPOINTS_HERE]+")   

Language codepoints implemented in ‘Stylo’ ver. 0.6.8

Latin supplement (Western): \U00C0-\U00FF
Latin supplement (Eastern): \U0100-\U01BF
Latin extended (phonetic): \U01C4-\U02AF
modern Greek: \U0386\U0388-\U03FF
Cyrillic: \U0400-\U0481\U048A-\U0527
Hebrew: \U05D0-\U05EA\U05F0-\U05F4
Arabic/Farsi: \U0620-\U065F\U066E-\U06D3\U06D5\U06DC
extended Latin: \U1E00-\U1EFF
ancient Greek: \U1F00-\U1FBC\U1FC2-\U1FCC\U1FD0-\U1FDB\U1FE0-\U1FEC\U1FF2-\U1FFC
Coptic: \U03E2-\U03EF\U2C80-\U2CF3
Georgian: \U10A0-\U10FF
Japanese (Hiragana): \U3040-\U309F
Japanese (Katagana): \U30A0-\U30FF
Japanese repetition symbols: \U3005\U3031-\U3035
CJK Unified Ideographs: \U4E00-\U9FFF
CJK Unified Ideographs Extension A: \U3400-\U4DBF
Hangul (Korean script): \UAC00-\UD7AF

What are the best settings for my language?

Frequent questions include:

are words or characters better features for language X?
how many most frequent words are good for tests in language X?
how big should the character n-grams be?

Feature selection is a complicated problem on its own, and so far few studies compared the settings in a cross-lingual perspective. If possible, a) try to look into literature dealing with your language of choice, b) start with testing the settings that are considered general gold standards. You will find some publications on the matter below.

Bibliography

Cross-lingual comparisons

Eder, M. (2011). Style-markers in authorship attribution: a cross-language study of authorial fingerprint. Studies in Polish Linguistics, 6, 99-114, pre-print.

Rybicki, J. and Eder, M. (2011). Deeper Delta across genres and languages: do we really need the most frequent words? Literary and Linguistic Computing, 26(3): 315-21, pre-print.

Non-Latin scripts

Du, K. (2016). Testing delta on Chinese texts. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 781-783.

Sample size standards

Eder, M.(2017c). Short samples in authorship attribution: A new approach. Digital Humanities 2017: Conference Abstracts. Montreal: McGill University, pp. 221–24.

Eder, M. (2010). Does size matter? Authorship attribution, short samples, big problem. Digital Humanities 2010: Conference Abstracts. King’s College London, pp. 132-35, pre-print.

Feature selection

Kestemont, M. (2014). Function words in authorship attribution. From black magic to theory? In Third Computational Linguistics for Literature Workshop, 59–66. Gothenburg, Sweden: European Chapter of the Association for Computational Linguistics.

Sapkota, U., Bethard, S., Montes, M., and Solorio, T. (2015). Not all character n-grams are created equal: a study in authorship attribution. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 93–102. Denver, Colorado: Association for Computational Linguistics.

Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21: 421–39.

You can find studies concerning many other languages in the Stylometry Bibliography, do contribute to it by sharing information about ones that were not yet included.