Boosting
word frequencies

in authorship attribution

Maciej Eder (maciej.eder@ijp.pan.pl)

14.12.2022


CHR2022 conference
Antwerp, 12–14.12.2022

introduction

stylometry

  • measures stylistic differences between texts
  • oftentimes aimed at authorship attribution
  • relies on stylistic fingerprint, …
  • … aka measurable linguistic features
    • frequencies of function words
    • frequencies of grammatical patterns, etc.
  • proves successful in several applications

areas of improvement

  • classification method
    • distant-based
    • svm, nsc, knn, …
    • neural networs
  • feature engineering
    • dimension reduction
    • lasso
  • feature choice
    • MFWs
    • POS n-grams
    • character n-grams

relative frequencies

simple normalization

Occurrences of the most frequent words (MFWs):

## 
##  the  and   to    i   of    a   in  was  her   it  you   he  she that  not   my 
## 4571 4748 3536 4130 2224 2326 1484 1127 1551 1391 1895 2138 1338 1250  937 1106

Relative frequencies:

## 
##    the    and     to      i     of      a     in    was    her     it    you 
## 0.0383 0.0398 0.0296 0.0346 0.0186 0.0195 0.0124 0.0094 0.0130 0.0116 0.0159

relative frequencies

The number of occurrences of a given word divided by the total number of words:

\[ f_\mathrm{the} = \frac{n_\mathrm{the}}{ n_\mathrm{the} + n_\mathrm{of} + n_\mathrm{and} + n_\mathrm{in} + ... } \]

In a generalized version:

\[ f_{w} = \frac{n_{w}}{N} \]

relative frequencies

  • routinely used
  • reliable
  • simple
  • intuitive
  • conceptually elegant

words that matter

synonyms

Proportions within synonym groups might betray a stylistic signal:

  • on and upon
  • drink and beverage
  • buy and purchase
  • big and large
  • et and atque and ac

proportions within synonyms

The proportion of on to upon:

\[ f_\mathrm{on} = \frac{n_\mathrm{on}}{ n_\mathrm{on} + n_\mathrm{upon} } \]

The proportion of upon to on:

\[ f_\mathrm{upon} = \frac{n_\mathrm{upon}}{ n_\mathrm{on} + n_\mathrm{upon} } \]

Certainly, they sum up to 1.

‘on’/total vs. ‘on’/(‘upon’ + ‘on’)

‘the’/total vs. ‘the’/(‘of’ + ‘the’)

limitations of synonyms

  • in many cases, several synonyms
    • cf. et and atque and ac in Latin
  • in many cases, no synonyms at all
  • target words might belong to different grammatical categories
  • what are the synonyms for function words?
  • provisional conclusion:
    • synonyms are but a subset of the words that matter

beyond synonyms

semantic similarity

  • target words: synonyms and more
  • e.g. for the word make the target words can involve:
    • perform, do, accomplish, finish, reach, produce, …
    • all their inflected forms (if applicable)
    • derivative words: nouns, adjectives, e.g. a deed
  • the size of a target semantic area is unknown

word vector models

  • trained on a large amount of textual data
  • capable of capturing (fuzzy) semantic relations between words
  • many implementations:
    • word2vec
    • GloVe
    • faxtText

GloVe model: examples

the neighbors of house:

##  house  where  place   room   town houses   farm  rooms    the   left 
##  1.000  0.745  0.688  0.685  0.664  0.651  0.647  0.638  0.637  0.635

the neighbors of home:

##   home return   come coming  going london   back     go   went   came 
##  1.000  0.732  0.717  0.705  0.696  0.689  0.682  0.670  0.666  0.665

the neighbors of buy:

##    buy   sell wanted   want   sold    get   send  wants   give  money 
##  1.000  0.728  0.552  0.539  0.537  0.536  0.535  0.531  0.526  0.510

the neighbors of style:

##    style  quality  fashion   manner     type    taste  manners language 
##    1.000    0.597    0.565    0.560    0.547    0.527    0.518    0.512 
##   proper  english 
##    0.504    0.504

relative frequencies revisited

for a 2-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} } \]

for a 5-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses} } \]

for a 7-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses} + n_\mathrm{farm} + n_\mathrm{rooms} } \]

will it fly?

experimental setup

  • a corpus of 99 novels in English
  • by 33 authors (3 texts per author)
  • tokenized and classified by the package stylo
    • stratified cross-validation scenario
    • 100 cross-validation folds
    • distance-based classification performed
    • F1 scores reported

distance measures used

  • classic Burrows’s Delta
  • Cosine Delta (Wurzburg)
  • Eder’s Delta
  • raw Manhattan distance

results

results for Cosine Delta

results for Cosine Delta

results for Burrows’s Delta

results for Eder’s Delta

results for Manhattan Distance

the best F1 scores

  • Cosine Delta: 0.96
  • Burrows’s Delta: 0.84
  • Eder’s Delta: 0.83
  • raw Manhattan: 0.77

how good are the results?

  • we know that Cosine Delta outperforms Classic Delta etc.
  • what is the actual gain in performance, then?
  • an additional round of tests performed to get baseline
  • the gain above the baseline reported below

gain for Cosine Delta

gain for Burrows’s Delta

gain for Eder’s Delta

gain for Manhattan Distance

conclusions

  • in each scenario, the gain was considerable
  • the hot spot of performance varied depending on the method…
  • … yet it was spread between 5 and 100 semantic neighbors
  • best classifiers are even better: up. to 12% improvement!

conclusions (cont.)

  • the new method is very simple
  • it doesn’t require any NLP tooling…
  • … except getting a general list of n semantic neighbors for MFWs
  • such a list can be generated once and re-used several times
  • if a rough method of tracing the words that matter was already successful, a bigger gain can be expected with sophisticated language models

Thank you!

mail:

twitter: @MaciejEder

GitHub: https://github.com/computationalstylistics/word_frequencies

appendix

alt semantic space

  • perhaps the n closest neighbors is not the best way to define semantic spaces
  • therefore: testing all the words at the cosine distance of x from the reference word

results for cosine similarities

gain for Cosine Delta

results for delta similarities

gain for Burrows’s Delta

results for Eder’s similarities

gain for Eder’s Delta