Manhattan, Euclidean,
and their Siblings

Exploring Exotic Similarity Measures in Text Classification

Maciej Eder

Polish Academy of Sciences

Jeremi Ochab

Jagiellonian University

08/08/2024

introduction

Why text analysis?

  • Authorship attribution
  • Forensic linguistics
  • Register analysis
  • Genre recognition
  • Gender differences
  • Translatorial signal
  • Early vs. mature style
  • Style evolution
  • Detecting dementia

How two compare a collection of texts?

  • Extracting valuable (i.e. countable) language features from texts
    • frequencies of words 👈
    • frequencies of syllables
    • versification patterns
    • distribution of topics
  • Comparing these features by means of multivariate analysis
    • distance-based methods 👈
    • neural networks

From words to features

‘It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.’
(J. Austen, Pride and Prejudice)

the” = 4.25%

in” = 3.45%

of” = 1.81%

to” = 1.44%

a” = 1.37%

was” = 1.17%

. . .

From features to similarities

                       the   and    to    of     a   was     I    in
coben_breaker        3.592 1.175 2.163 1.376 2.519 1.502 1.445 1.176
coben_dropshot       3.588 1.179 2.122 1.269 2.375 1.567 1.497 1.040
coben_fadeaway       3.931 1.445 2.200 1.213 2.306 1.323 1.330 1.198
coben_falsemove      3.625 1.613 2.134 1.237 2.401 1.375 1.346 1.109
coben_goneforgood    3.834 1.817 2.153 1.176 1.962 1.733 3.814 1.131
coben_nosecondchance 4.098 1.589 2.271 1.206 1.992 1.758 3.855 1.151
coben_tellnoone      4.102 1.790 2.031 1.246 2.176 1.418 3.499 1.162
galbraith_cuckoos    4.523 2.267 2.494 2.179 2.141 1.656 1.127 1.380
lewis_battle         5.051 3.405 2.138 2.138 1.960 1.511 0.902 1.284
lewis_caspian        4.865 3.592 2.153 2.144 2.168 1.353 1.115 1.212
lewis_chair          4.973 3.221 1.997 2.103 2.354 1.405 1.073 1.214
lewis_horse          4.885 3.487 2.306 2.224 2.322 1.403 1.195 1.298
lewis_lion           5.141 3.699 2.295 2.185 2.100 1.346 0.813 1.162
lewis_nephew         4.482 2.856 2.070 2.231 2.311 1.571 1.179 1.355
lewis_voyage         5.222 3.279 2.261 2.114 2.244 1.583 1.048 1.153
rowling_casual       4.749 2.639 2.625 2.108 1.763 1.646 0.561 1.443
rowling_chamber      4.415 2.344 2.352 1.877 2.001 1.481 0.882 1.168
rowling_goblet       4.483 2.426 2.486 2.022 1.791 1.423 0.849 1.117
rowling_hallows      4.696 2.473 2.244 1.870 1.449 1.126 0.525 0.995

What we hope to get

theory

What is a distance?

take any two texts:

                the   and    to    of     a   was     I    in    he  said   you
lewis_lion    5.141 3.699 2.295 2.185 2.100 1.346 0.813 1.162 1.087 1.426 1.141
tolkien_lord1 5.624 3.782 2.074 2.597 1.916 1.313 1.492 1.419 1.221 0.825 0.872

subtract the values vertically:

          the    and    to     of     a   was      I     in     he  said   you
       -0.483 -0.083 0.221 -0.412 0.184 0.033 -0.679 -0.257 -0.134 0.601 0.269

then drop the minuses:

                the   and    to    of     a   was     I    in    he  said   you
              0.483 0.083 0.221 0.412 0.184 0.033 0.679 0.257 0.134 0.601 0.269

sum up the obtained values:

[1] 3.356

Manhattan vs. Euclidean

Euclidean distance

between any two texts represented by two points A and B in an n-dimensional space can be defined as:

\[ \delta_{AB} = \sqrt{ \sum_{i = 1}^{n} (A_i - B_i)^2 } \]

where A and B are the two documents to be compared, and \(A_i,\, B_i\) are the scaled (z-scored) frequencies of the i-th word in the range of n most frequent words.

Manhattan distance

can be formalized as follows:

\[ \delta_{AB} = \sum_{i = 1}^{n} | A_i - B_i | \]

which is equivalent to

\[ \delta_{AB} = \sqrt[1]{ \sum_{i = 1}^{n} | A_i - B_i |^1 } \]

(the above weird notation will soon become useful)

Euclidean and Manhattan are siblings!

\[ \delta_{AB} = \sqrt[2]{ \sum_{i = 1}^{n} (A_i - B_i)^2 } \]

vs.

\[ \delta_{AB} = \sqrt[1]{ \sum_{i = 1}^{n} | A_i - B_i |^1 } \]

For that reason, Manhattan and Euclidean are named L1 and L2, respectively.

An (infinite) family of distances

  • The above observations can be further generalized
  • Both Manhattan and Euclidean belong to a family of (possible) distances:

\[ \delta_{AB} = \sqrt[p]{ \sum_{i = 1}^{n} | A_i - B_i |^p } \]

where p is both the power and the degree of the root.

The norms L1, L2, L3, … (and beyond)

  • The power p doesn’t need to be a natural number
  • We can easily imagine norms such as L1.01, L3.14159, L1¾, L\(\sqrt{2}\) etc.
  • Mathematically, \(p < 1\) doesn’t satisfy the formal definition of a norm…
  • … yet still, one can easily imagine a dissimilarity L0.5 or L0.0001.
  • (plus, the so-called Cosine Distance doesn’t satisfy the definition either).

To summarize…

  • The p parameter is a continuum
  • Both \(p = 1\) and \(p = 2\) (for Manhattan and Euclidean, respectively) are but two specific points in this continuous space
  • p is a method’s hyperparameter to be set or possibly tuned

research question

🧐 How do the norms from a wide range beyond L1 and L2 affect text classification?

experiment

Data

Four full-text datasets used:

  • 99 English novels by 33 authors,
  • 99 Polish novels by 33 authors,
  • 28 books by 8 American Southern authors:
    • Harper Lee, Truman Capote, William Faulkner, Ellen Glasgow, Carson McCullers, Flannery O’Connor, William Styron and Eudora Welty,
  • 26 books by 5 fantasy authors:
    • J.K. Rowling, Harlan Coben, C.S. Lewis, and J.R.R. Tolkien.

Method

  • A supervised classification experiment was designed
  • Aimed at authorship attribution
  • leave-one-out cross-validation scenario
    • 100 independent bootstrap iterations…
    • … each of them involving 50% randomly selected input features (most frequent words)
    • The procedure repeated for the ranges of 100, 200, 300, …, 1000 most frequent words.
  • The whole experiment repeated iteratively for L0.1, L0.2, …, L10.
  • The performance in each iteration evaluated using accuracy, recall, precision, and the F1 scores.

results

99 English novels by 33 authors

99 Polish novels by 33 authors

28 novels by 8 Southern authors

26 novels by 5 fantasy writers

conclusions

A few observations

  • Metrics with lower \(p\) generally outperform higher-order norms.
  • Specifically, Manhattan is better than Euclidean…
  • … but values \(p < 1\) are even better.
  • Feature vectors that yield the best results (here: long vectors of most frequent words) are the most sensitive to the choice of the distance measure.

Plausible explanations

  • Small \(p\) makes it more important for two feature vectors to have fewer differing features (rather than smaller differences among many features),
  • Small \(p\) amplifies small differences (important, e.g., for low-frequency features in distinguishing between 0 difference – for two texts lacking a feature – and a small difference).

Therefore:

  • Small \(p\) norms might be one way of effectively utilizing long feature vectors.

Questions?

Sources of the texts: the ELTeC corpus, and the package stylo.

The code and the datasets: https://github.com/computationalstylistics/beyond_Manhattan

appendix

L p distances vs. Cosine Delta

English novels:

mfw Manhattan Lp distance Cosine
100 0.625 0.625 (p = 0.9) 0.666
500 0.814 0.823 (p = 0.6) 0.865
1000 0.833 0.871 (p = 0.3) 0.892

Polish novels:

mfw Manhattan Lp distance Cosine
100 0.655 0.659 (p = 0.8) 0.684
500 0.760 0.769 (p = 0.6) 0.840
1000 0.751 0.835 (p = 0.1) 0.842

99 English novels by 33 authors

99 Polish novels by 33 authors

Triangle inequality broken