Boosting word frequencies
in authorship attribution
14.12.2022 CHR2022 conference
Antwerp, 12–14.12.2022
stylometry
- measures stylistic differences between texts
- oftentimes aimed at authorship attribution
- relies on stylistic fingerprint, …
- … aka measurable linguistic features
- frequencies of function words
- frequencies of grammatical patterns, etc.
- proves successful in several applications
areas of improvement
- classification method
- distant-based
- svm, nsc, knn, …
- neural networs
- …
- feature engineering
- dimension reduction
- lasso
- …
- feature choice
- MFWs
- POS n-grams
- character n-grams
- …
simple normalization
Occurrences of the most frequent words (MFWs):
##
## the and to i of a in was her it you he she that not my
## 4571 4748 3536 4130 2224 2326 1484 1127 1551 1391 1895 2138 1338 1250 937 1106
Relative frequencies:
##
## the and to i of a in was her it you
## 0.0383 0.0398 0.0296 0.0346 0.0186 0.0195 0.0124 0.0094 0.0130 0.0116 0.0159
relative frequencies
The number of occurrences of a given word divided by the total number
of words:
\[ f_\mathrm{the} = \frac{n_\mathrm{the}}{
n_\mathrm{the} + n_\mathrm{of} + n_\mathrm{and} + n_\mathrm{in} + ... }
\]
In a generalized version:
\[ f_{w} = \frac{n_{w}}{N} \]
relative frequencies
- routinely used
- reliable
- simple
- intuitive
- conceptually elegant
synonyms
Proportions within synonym groups might betray a stylistic
signal:
- on and upon
- drink and beverage
- buy and purchase
- big and large
- et and atque and ac
proportions within synonyms
The proportion of on to upon:
\[ f_\mathrm{on} = \frac{n_\mathrm{on}}{
n_\mathrm{on} + n_\mathrm{upon} } \]
The proportion of upon to on:
\[ f_\mathrm{upon} =
\frac{n_\mathrm{upon}}{ n_\mathrm{on} + n_\mathrm{upon} } \]
Certainly, they sum up to 1.
‘on’/total vs. ‘on’/(‘upon’ + ‘on’)

‘the’/total vs. ‘the’/(‘of’ + ‘the’)

limitations of synonyms
- in many cases, several synonyms
- cf. et and atque and ac in Latin
- in many cases, no synonyms at all
- target words might belong to different grammatical categories
- what are the synonyms for function words?
- provisional conclusion:
- synonyms are but a subset of the words that matter
semantic similarity
- target words: synonyms and more
- e.g. for the word make the target words can involve:
- perform, do, accomplish, finish,
reach, produce, …
- all their inflected forms (if applicable)
- derivative words: nouns, adjectives, e.g. a deed
- the size of a target semantic area is unknown
word vector models
- trained on a large amount of textual data
- capable of capturing (fuzzy) semantic relations between words
- many implementations:
- word2vec
- GloVe
- faxtText
- …
GloVe model: examples
the neighbors of house:
## house where place room town houses farm rooms the left
## 1.000 0.745 0.688 0.685 0.664 0.651 0.647 0.638 0.637 0.635
the neighbors of home:
## home return come coming going london back go went came
## 1.000 0.732 0.717 0.705 0.696 0.689 0.682 0.670 0.666 0.665
the neighbors of buy:
## buy sell wanted want sold get send wants give money
## 1.000 0.728 0.552 0.539 0.537 0.536 0.535 0.531 0.526 0.510
the neighbors of style:
## style quality fashion manner type taste manners language
## 1.000 0.597 0.565 0.560 0.547 0.527 0.518 0.512
## proper english
## 0.504 0.504
relative frequencies revisited
for a 2-word semantic space, the frequency of the word
house:
\[ f_\mathrm{house} =
\frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} +
n_\mathrm{place} } \]
for a 5-word semantic space, the frequency of the word
house:
\[ f_\mathrm{house} =
\frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} +
n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses}
} \]
for a 7-word semantic space, the frequency of the word
house:
\[ f_\mathrm{house} =
\frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} +
n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses}
+ n_\mathrm{farm} + n_\mathrm{rooms} } \]
experimental setup
- a corpus of 99 novels in English
- by 33 authors (3 texts per author)
- tokenized and classified by the package
stylo
- stratified cross-validation scenario
- 100 cross-validation folds
- distance-based classification performed
- F1 scores reported
distance measures used
- classic Burrows’s Delta
- Cosine Delta (Wurzburg)
- Eder’s Delta
- raw Manhattan distance
results for Cosine Delta

results for Burrows’s Delta

results for Eder’s Delta

results for Manhattan Distance

the best F1 scores
- Cosine Delta: 0.96
- Burrows’s Delta: 0.84
- Eder’s Delta: 0.83
- raw Manhattan: 0.77
how good are the results?
- we know that Cosine Delta outperforms Classic Delta etc.
- what is the actual gain in performance, then?
- an additional round of tests performed to get baseline
- the gain above the baseline reported below
gain for Cosine Delta

gain for Burrows’s Delta

gain for Eder’s Delta

gain for Manhattan Distance

conclusions
- in each scenario, the gain was
considerable
- the hot spot of performance varied depending on the method…
- … yet it was spread between 5 and 100 semantic neighbors
- best classifiers are even better: up. to 12% improvement!
conclusions (cont.)
- the new method is very simple
- it doesn’t require any NLP tooling…
- … except getting a general list of n semantic neighbors for
MFWs
- such a list can be generated once and re-used several times
- if a rough method of tracing the words that matter was
already successful, a bigger gain can be expected with sophisticated
language models
alt semantic space
- perhaps the n closest neighbors is not the best way to
define semantic spaces
- therefore: testing all the words at the cosine distance of
x from the reference word
results for cosine similarities

gain for Cosine Delta

results for delta similarities

gain for Burrows’s Delta

results for Eder’s similarities

gain for Eder’s Delta
