# Custom distance measures

Distance measures in stylometry… This is definitely the topic one never
gets bored with. I’ve been promising myself to write a longer piece on
this one day: not this time, though. In this short post, I’m going to
introduce a functionality of the R package `stylo`

(ver. >= 0.6.0) that
allows for testing *any* distance.

Apart from the already-implemented distance measures (most of them can
be conveniently selected via GUI), the package `stylo`

features a socket
for plugging in your own custom distances. In short, a bit of coding
combined with some expertise in maths – that’s basically all we need. We
have to design a function which takes a table with frequencies as an
input parameter, and returns a square table of distances: it can be
either an object of the generic class `dist`

(the lower triangle of the
distance matrix stored by columns in a vector), or a regular full
matrix, symmetric across the diagonal. Whatever.

Suppose you want to test the distance discussed in a great paper by Jannidis, Schoch, Pielstrom & Vitt (Jannidis et al., 2015) presented at DH2015. This is a regular Cosine Distance applied to z-scored data. Interestingly enough, this measure has been (sort of) introduced in an earlier study (Smith and Aldridge, 2011), but never tested before the Würzburg guys took over. First, we prepare a custom function (this is a simple version: a real function should check if the input dataset can be further processed, if it is a matrix, etc.). Type the following code:

```
my.cosine.distance = function(x){
# z-scoring the input matrix of frequencies
x = scale(x)
# computing cosine dissimilarity
y = as.dist( x %*% t(x) / (sqrt(rowSums(x^2) %*% t(rowSums(x^2)))) )
# then, turning it into cosine similarity
z = 1 - y
# getting the results
return(z)
}
```

Once the above code is copy-pasted to the R console, it becomes a new
function named `my.cosine.distance()`

, and it becomes visible for other
R objects. The function, however, is not persistent: the paste-copying
step has to be repeated every time a new R session launched.

Having completed the above step, we’re all set. Now, one can use the
tailored distance function with any of the main functions of the package
`stylo`

. Note the following examples:

```
stylo(distance.measure = "my.cosine.distance")
classify(distance.measure = "my.cosine.distance")
rolling.classify(distance.measure = "my.cosine.distance")
```

The Cosine Delta (aka Würzburg Delta) is but one example of replacing the original kernel of the Delta method with a custom distance. What about trying something else? Assume that you plan to test if the Entropy Distance outperforms other similarity measures. It has been reported in the literature (Juola and Baayen, 2005) that entropy-based distances are generally accurate in stylometry. Let’s define a tailored function:

```
dist.entropy = function(x) {
A = t(t(x + 1) / colSums(x + 1))
B = t(t(log(x + 2)) / -(colSums(A * log(A))))
y = dist(B, method="manhattan")
return(y)
}
```

The next step is rather obvious, given the already-discussed examples:

```
stylo(distance.measure = "dist.entropy")
```

Now, what about an inverse correlation distance? It has been successfully applied in a cross-language benchmark study distances in stylometry (Forsyth and Sharoff, 2014), in which, by the way, a few other interesting measures have been tested. The following code was contributed by Richard Forsyth:

```
## Additional distance function for Stylo() :
## submitted by R.S. Forsyth.
## First version : 30/10/2013
## Last revision : 30/10/2013
cordist = function (freqvec,cols=89,usemeth="spearman") {
## takes 2D vector of word frex [,1:cols] (table.with.all.freqs);
## returns a distance matrx 2b used in clustering or MDS.
## Distance index is inverse correlation (Rank.Corr default).
if (cols < 2) stop("Too few cols!!")
## table.with.all.freqs usually has too many cols;
## don't know where correct number to be used is saved;
## therefore currently needed as function argument.
tranfrex = t(freqvec[,1:cols]) ## columns are features (words/grams)
dmat = 1-cor(tranfrex,meth=usemeth)
## meth cd also be "kendall" or "pearson" (not recommended).
return (as.dist(dmat))
## names seem to stay attached, as attr(*,"Labels").
} ## ready for hclust() or sammon() scaling.
```

Etc. etc. etc. It is claimed that over 5,000 distance measures have been introduced so far in exact sciences (Moisl, 2014). Try them all!

## References

**Forsyth, R. and Sharoff, S.** (2014). Document dissimilarity within
and across languages: A benchmarking study. *Literary and Linguistic
Computing*, **29**(1): 6–22
doi:10.1093/llc/fqt002 (accessed
31 May 2018).

**Jannidis, F., Pielström, S., Schöch, C. and Vitt, T.** (2015).
Improving Burrows’ Delta – An empirical evaluation of text distance
measures. In, *Digital Humanities 2015: Conference Abstracts*. Sydney,
Australia: University of Western Sydney http://dh2015.org/abstracts.

**Juola, P. and Baayen, H.** (2005). A controlled-corpus experiment in
authorship attribution by cross-entropy. *Literary and Linguistic
Computing*, **20**(1): 59–67.

**Moisl, H.** (2014). *Cluster Analysis for Corpus Linguistics*. Berlin:
Mouton de Gruyter.

**Smith, P. W. H. and Aldridge, W.** (2011). Improving authorship
attribution: Optimizing Burrows’ Delta method. *Journal of Quantitative
Linguistics*, **18**(1): 63–88
doi:10.1080/09296174.2011.533591
(accessed 22 April 2018).