Boosting
word frequencies

in authorship attribution

Maciej Eder (maciej.eder@ijp.pan.pl)

14.12.2022

CHR2022 conference
Antwerp, 12–14.12.2022

introduction

stylometry

measures stylistic differences between texts
oftentimes aimed at authorship attribution
relies on stylistic fingerprint, …
… aka measurable linguistic features
- frequencies of function words
- frequencies of grammatical patterns, etc.
proves successful in several applications

areas of improvement

classification method
- distant-based
- svm, nsc, knn, …
- neural networs
- …
feature engineering
- dimension reduction
- lasso
- …
feature choice
- MFWs
- POS n-grams
- character n-grams
- …

relative frequencies

simple normalization

Occurrences of the most frequent words (MFWs):

## 
##  the  and   to    i   of    a   in  was  her   it  you   he  she that  not   my 
## 4571 4748 3536 4130 2224 2326 1484 1127 1551 1391 1895 2138 1338 1250  937 1106

Relative frequencies:

## 
##    the    and     to      i     of      a     in    was    her     it    you 
## 0.0383 0.0398 0.0296 0.0346 0.0186 0.0195 0.0124 0.0094 0.0130 0.0116 0.0159

relative frequencies

The number of occurrences of a given word divided by the total number of words:

\[ f_\mathrm{the} = \frac{n_\mathrm{the}}{ n_\mathrm{the} + n_\mathrm{of} + n_\mathrm{and} + n_\mathrm{in} + ... } \]

In a generalized version:

\[ f_{w} = \frac{n_{w}}{N} \]

relative frequencies

routinely used
reliable
simple
intuitive
conceptually elegant

words that matter

synonyms

Proportions within synonym groups might betray a stylistic signal:

on and upon
drink and beverage
buy and purchase
big and large
et and atque and ac

proportions within synonyms

The proportion of on to upon:

\[ f_\mathrm{on} = \frac{n_\mathrm{on}}{ n_\mathrm{on} + n_\mathrm{upon} } \]

The proportion of upon to on:

\[ f_\mathrm{upon} = \frac{n_\mathrm{upon}}{ n_\mathrm{on} + n_\mathrm{upon} } \]

Certainly, they sum up to 1.

‘on’/total vs. ‘on’/(‘upon’ + ‘on’)

‘the’/total vs. ‘the’/(‘of’ + ‘the’)

limitations of synonyms

in many cases, several synonyms
- cf. et and atque and ac in Latin
in many cases, no synonyms at all
target words might belong to different grammatical categories
what are the synonyms for function words?
provisional conclusion:
- synonyms are but a subset of the words that matter

beyond synonyms

semantic similarity

target words: synonyms and more
e.g. for the word make the target words can involve:
- perform, do, accomplish, finish, reach, produce, …
- all their inflected forms (if applicable)
- derivative words: nouns, adjectives, e.g. a deed
the size of a target semantic area is unknown

word vector models

trained on a large amount of textual data
capable of capturing (fuzzy) semantic relations between words
many implementations:
- word2vec
- GloVe
- faxtText
- …

GloVe model: examples

the neighbors of house:

##  house  where  place   room   town houses   farm  rooms    the   left 
##  1.000  0.745  0.688  0.685  0.664  0.651  0.647  0.638  0.637  0.635

the neighbors of home:

##   home return   come coming  going london   back     go   went   came 
##  1.000  0.732  0.717  0.705  0.696  0.689  0.682  0.670  0.666  0.665

the neighbors of buy:

##    buy   sell wanted   want   sold    get   send  wants   give  money 
##  1.000  0.728  0.552  0.539  0.537  0.536  0.535  0.531  0.526  0.510

the neighbors of style:

##    style  quality  fashion   manner     type    taste  manners language 
##    1.000    0.597    0.565    0.560    0.547    0.527    0.518    0.512 
##   proper  english 
##    0.504    0.504

relative frequencies revisited

for a 2-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} } \]

for a 5-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses} } \]

for a 7-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses} + n_\mathrm{farm} + n_\mathrm{rooms} } \]

will it fly?

experimental setup

a corpus of 99 novels in English
by 33 authors (3 texts per author)
tokenized and classified by the package stylo
- stratified cross-validation scenario
- 100 cross-validation folds
- distance-based classification performed
- F1 scores reported

distance measures used

classic Burrows’s Delta
Cosine Delta (Wurzburg)
Eder’s Delta
raw Manhattan distance

results

results for Cosine Delta

results for Burrows’s Delta

results for Eder’s Delta

results for Manhattan Distance

the best F1 scores

Cosine Delta: 0.96
Burrows’s Delta: 0.84
Eder’s Delta: 0.83
raw Manhattan: 0.77

how good are the results?

we know that Cosine Delta outperforms Classic Delta etc.
what is the actual gain in performance, then?
an additional round of tests performed to get baseline
the gain above the baseline reported below

gain for Cosine Delta

gain for Burrows’s Delta

gain for Eder’s Delta

gain for Manhattan Distance

conclusions

in each scenario, the gain was considerable
the hot spot of performance varied depending on the method…
… yet it was spread between 5 and 100 semantic neighbors
best classifiers are even better: up. to 12% improvement!

conclusions (cont.)

the new method is very simple
it doesn’t require any NLP tooling…
… except getting a general list of n semantic neighbors for MFWs
such a list can be generated once and re-used several times
if a rough method of tracing the words that matter was already successful, a bigger gain can be expected with sophisticated language models

Thank you!

mail: maciej.eder@ijp.pan.pl

twitter: @MaciejEder

GitHub: https://github.com/computationalstylistics/word_frequencies

appendix

alt semantic space

perhaps the n closest neighbors is not the best way to define semantic spaces
therefore: testing all the words at the cosine distance of x from the reference word

Boosting word frequencies

in authorship attribution

Maciej Eder (maciej.eder@ijp.pan.pl)

14.12.2022 CHR2022 conference Antwerp, 12–14.12.2022

introduction

stylometry

areas of improvement

relative frequencies

simple normalization

relative frequencies

relative frequencies

words that matter

synonyms

proportions within synonyms

‘on’/total vs. ‘on’/(‘upon’ + ‘on’)

‘the’/total vs. ‘the’/(‘of’ + ‘the’)

limitations of synonyms

beyond synonyms

semantic similarity

word vector models

GloVe model: examples

relative frequencies revisited

will it fly?

experimental setup

distance measures used

results

results for Cosine Delta

results for Cosine Delta

results for Burrows’s Delta

results for Eder’s Delta

results for Manhattan Distance

the best F1 scores

how good are the results?

gain for Cosine Delta

gain for Burrows’s Delta

gain for Eder’s Delta

gain for Manhattan Distance

conclusions

conclusions (cont.)

Thank you!

appendix

alt semantic space

results for cosine similarities

gain for Cosine Delta

results for delta similarities

gain for Burrows’s Delta

results for Eder’s similarities

gain for Eder’s Delta

Boosting
word frequencies

14.12.2022

CHR2022 conference
Antwerp, 12–14.12.2022