Manhattan, Euclidean,
and their Siblings

Exploring Exotic Similarity Measures in Text Classification

Maciej Eder

maciej.eder@ijp.pan.pl

Polish Academy of Sciences

Jeremi Ochab

jeremi.ochab@uj.edu.pl

Jagiellonian University

08/08/2024

introduction

Why text analysis?

Authorship attribution
Forensic linguistics
Register analysis
Genre recognition
Gender differences
Translatorial signal
Early vs. mature style
Style evolution
Detecting dementia
…

How two compare a collection of texts?

Extracting valuable (i.e. countable) language features from texts
- frequencies of words 👈
- frequencies of syllables
- versification patterns
- distribution of topics
- …
Comparing these features by means of multivariate analysis
- distance-based methods 👈
- neural networks
- …

From words to features

‘It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.’
(J. Austen, Pride and Prejudice)

“the” = 4.25%

“in” = 3.45%

“of” = 1.81%

“to” = 1.44%

“a” = 1.37%

“was” = 1.17%

. . .

From features to similarities

                       the   and    to    of     a   was     I    in
coben_breaker        3.592 1.175 2.163 1.376 2.519 1.502 1.445 1.176
coben_dropshot       3.588 1.179 2.122 1.269 2.375 1.567 1.497 1.040
coben_fadeaway       3.931 1.445 2.200 1.213 2.306 1.323 1.330 1.198
coben_falsemove      3.625 1.613 2.134 1.237 2.401 1.375 1.346 1.109
coben_goneforgood    3.834 1.817 2.153 1.176 1.962 1.733 3.814 1.131
coben_nosecondchance 4.098 1.589 2.271 1.206 1.992 1.758 3.855 1.151
coben_tellnoone      4.102 1.790 2.031 1.246 2.176 1.418 3.499 1.162
galbraith_cuckoos    4.523 2.267 2.494 2.179 2.141 1.656 1.127 1.380
lewis_battle         5.051 3.405 2.138 2.138 1.960 1.511 0.902 1.284
lewis_caspian        4.865 3.592 2.153 2.144 2.168 1.353 1.115 1.212
lewis_chair          4.973 3.221 1.997 2.103 2.354 1.405 1.073 1.214
lewis_horse          4.885 3.487 2.306 2.224 2.322 1.403 1.195 1.298
lewis_lion           5.141 3.699 2.295 2.185 2.100 1.346 0.813 1.162
lewis_nephew         4.482 2.856 2.070 2.231 2.311 1.571 1.179 1.355
lewis_voyage         5.222 3.279 2.261 2.114 2.244 1.583 1.048 1.153
rowling_casual       4.749 2.639 2.625 2.108 1.763 1.646 0.561 1.443
rowling_chamber      4.415 2.344 2.352 1.877 2.001 1.481 0.882 1.168
rowling_goblet       4.483 2.426 2.486 2.022 1.791 1.423 0.849 1.117
rowling_hallows      4.696 2.473 2.244 1.870 1.449 1.126 0.525 0.995

What we hope to get

theory

What is a distance?

take any two texts:

                the   and    to    of     a   was     I    in    he  said   you
lewis_lion    5.141 3.699 2.295 2.185 2.100 1.346 0.813 1.162 1.087 1.426 1.141
tolkien_lord1 5.624 3.782 2.074 2.597 1.916 1.313 1.492 1.419 1.221 0.825 0.872

subtract the values vertically:

          the    and    to     of     a   was      I     in     he  said   you
       -0.483 -0.083 0.221 -0.412 0.184 0.033 -0.679 -0.257 -0.134 0.601 0.269

then drop the minuses:

                the   and    to    of     a   was     I    in    he  said   you
              0.483 0.083 0.221 0.412 0.184 0.033 0.679 0.257 0.134 0.601 0.269

sum up the obtained values:

[1] 3.356

Manhattan vs. Euclidean

Euclidean distance

between any two texts represented by two points A and B in an n-dimensional space can be defined as:

\[ \delta_{AB} = \sqrt{ \sum_{i = 1}^{n} (A_i - B_i)^2 } \]

where A and B are the two documents to be compared, and \(A_i,\, B_i\) are the scaled (z-scored) frequencies of the i-th word in the range of n most frequent words.

Manhattan distance

can be formalized as follows:

\[ \delta_{AB} = \sum_{i = 1}^{n} | A_i - B_i | \]

which is equivalent to

\[ \delta_{AB} = \sqrt[1]{ \sum_{i = 1}^{n} | A_i - B_i |^1 } \]

(the above weird notation will soon become useful)

Euclidean and Manhattan are siblings!

\[ \delta_{AB} = \sqrt[2]{ \sum_{i = 1}^{n} (A_i - B_i)^2 } \]

vs.

\[ \delta_{AB} = \sqrt[1]{ \sum_{i = 1}^{n} | A_i - B_i |^1 } \]

For that reason, Manhattan and Euclidean are named L1 and L2, respectively.

An (infinite) family of distances

The above observations can be further generalized
Both Manhattan and Euclidean belong to a family of (possible) distances:

\[ \delta_{AB} = \sqrt[p]{ \sum_{i = 1}^{n} | A_i - B_i |^p } \]

where p is both the power and the degree of the root.

The norms L1, L2, L3, … (and beyond)

The power p doesn’t need to be a natural number
We can easily imagine norms such as L1.01, L3.14159, L1¾, L\(\sqrt{2}\) etc.
Mathematically, \(p < 1\) doesn’t satisfy the formal definition of a norm…
… yet still, one can easily imagine a dissimilarity L0.5 or L0.0001.
(plus, the so-called Cosine Distance doesn’t satisfy the definition either).

To summarize…

The p parameter is a continuum
Both \(p = 1\) and \(p = 2\) (for Manhattan and Euclidean, respectively) are but two specific points in this continuous space
p is a method’s hyperparameter to be set or possibly tuned

research question

🧐 How do the norms from a wide range beyond L1 and L2 affect text classification?

experiment

Data

Four full-text datasets used:

99 English novels by 33 authors,
99 Polish novels by 33 authors,
28 books by 8 American Southern authors:
- Harper Lee, Truman Capote, William Faulkner, Ellen Glasgow, Carson McCullers, Flannery O’Connor, William Styron and Eudora Welty,
26 books by 5 fantasy authors:
- J.K. Rowling, Harlan Coben, C.S. Lewis, and J.R.R. Tolkien.

Method

A supervised classification experiment was designed
Aimed at authorship attribution
leave-one-out cross-validation scenario
- 100 independent bootstrap iterations…
- … each of them involving 50% randomly selected input features (most frequent words)
- The procedure repeated for the ranges of 100, 200, 300, …, 1000 most frequent words.
The whole experiment repeated iteratively for L0.1, L0.2, …, L10.
The performance in each iteration evaluated using accuracy, recall, precision, and the F1 scores.

results

99 English novels by 33 authors

99 Polish novels by 33 authors

28 novels by 8 Southern authors

26 novels by 5 fantasy writers

conclusions

A few observations

Metrics with lower \(p\) generally outperform higher-order norms.
Specifically, Manhattan is better than Euclidean…
… but values \(p < 1\) are even better.
Feature vectors that yield the best results (here: long vectors of most frequent words) are the most sensitive to the choice of the distance measure.

Plausible explanations

Small \(p\) makes it more important for two feature vectors to have fewer differing features (rather than smaller differences among many features),
Small \(p\) amplifies small differences (important, e.g., for low-frequency features in distinguishing between 0 difference – for two texts lacking a feature – and a small difference).

Therefore:

Small \(p\) norms might be one way of effectively utilizing long feature vectors.

Questions?

Sources of the texts: the ELTeC corpus, and the package stylo.

The code and the datasets: https://github.com/computationalstylistics/beyond_Manhattan

appendix

L p distances vs. Cosine Delta

English novels:

mfw	Manhattan	Lp distance	Cosine
100	0.625	0.625 (p = 0.9)	0.666
500	0.814	0.823 (p = 0.6)	0.865
1000	0.833	0.871 (p = 0.3)	0.892

Polish novels:

mfw	Manhattan	Lp distance	Cosine
100	0.655	0.659 (p = 0.8)	0.684
500	0.760	0.769 (p = 0.6)	0.840
1000	0.751	0.835 (p = 0.1)	0.842

Manhattan, Euclidean, and their Siblings

introduction

Why text analysis?

How two compare a collection of texts?

From words to features

From features to similarities

What we hope to get

theory

What is a distance?

Manhattan vs. Euclidean

Euclidean distance

Manhattan distance

Euclidean and Manhattan are siblings!

An (infinite) family of distances

The norms L1, L2, L3, … (and beyond)

To summarize…

research question

🧐 How do the norms from a wide range beyond L1 and L2 affect text classification?

experiment

Data

Method

results

99 English novels by 33 authors

99 Polish novels by 33 authors

28 novels by 8 Southern authors

26 novels by 5 fantasy writers

conclusions

A few observations

Plausible explanations

Questions?

appendix

L p distances vs. Cosine Delta

99 English novels by 33 authors

99 Polish novels by 33 authors

Triangle inequality broken

Manhattan, Euclidean,
and their Siblings