Using the function stylo() in batch mode
Introduction
The stylometric package stylo
is known mostly for its main function,
namely stylo()
, which is a do-it-all-at-once piece of software to be
run in interactive mode. Advanced users sooner or later discover other
functions such as classify()
, oppose()
etc., let alone lower-level
functions to perform text-preprocessing. Conveniently, all these
functions can serve as building blocks to design your own experiment,
which might be run on a server in batch mode (by running a script,
rather than via GUI). Some of the above functionalities are covered in
previous posts.
A natural question can be asked whether the main function stylo()
exhibits similar properties: is it possible to invoke it from inside a
script, and run on a server in batch mode? A short answer is yes, it is
possible, for two different reasons. Firstly, all the R functions can be
evaluated in such a way that whatever they return is piped into a
variable, e.g.:
my_new_variable = stylo()
From now on, the object my_new_variable
will store the output of the
function stylo()
(no plots, though). Secondly, the function stylo()
can silence its GUI interface by using a dedicated argument.
Consequently, all the other parameters of the model can be passed as
command-line arguments, e.g.:
another_variable = stylo(gui = FALSE, mfw.min = 200, mfw.max = 200)
The above call will read input texts from files, and then perform a default analysis (hierarchical clustering, Classic Delta similarity measure) with 200 most frequent words as stylometric features. By adding additional arguments, one is able to set the linkage method, visualization flavor (or choose to silence the visualizations); one can even use the already existing table of word frequencies without loading the input texts from the files.
To test the applicability of the function stylo()
in batch mode, let’s
attack a relatively simple problem which is however very rarely reported
in literature. Namely, we will address the question to which extent
stylometric distance between two different texts is stable. Certainly,
several studies have shown that the final results depend on the choice
of the features – the picture for 100 most frequent words is different
than that for 1,000 most frequent words – but there is still very little
awareness of what it really means, statistically.
In the following sections we will gradually build a script to perform a real-life experiment testing the similarities between different texts in a corpus. We will start with a simple scenario involving just two texts: we will examine the variability of the distance between them given different input features.
What is a distance between two given texts, really?
The idea is simple. From a pool of available features (here: most
frequent words), we randomly select, say, 100 of them, and compare the
two input texts using the function stylo()
. Then we repeat the
procedure and record the resulting distance. Then we observe the
statistical properties of the obtained distances.
To randomly select n elements from the pool of features, we can use
the generic function sample()
. E.g. if we want to draw 10 elements
from the range 1:100, we type:
sample(1:100, 10)
## [1] 71 5 27 94 44 92 85 74 9 100
Now, the goal is to put all the pieces together. We aim to perform 100
iterations, and in each iteration we aim to estimate the distance
between two texts from the lee
dataset. As an example, we’ll use the
1st and the 2nd text, e.g. In Cold Blood by
Truman Capote, and Breakfast at Tiffany’s by the same author. We run
the loop 100 times, and in each turn we run the function stylo()
with
frequencies of randomly selected 100 features. Once a loop is complete,
we collect the results.
The dataset lee
is provided by the package stylo
. It is a matrix
with word frequencies of 3,000 MFWs for 28 American novels. Type
help(lee)
to see what’s inside.
library(stylo)
data(lee)
final.results = c()
for(iteration in 1:100) {
# randomly pick 100 numbers from the range 1:3000
pick.features = sample(1:3000, 100, replace = FALSE)
# select the features matching the 100 random numbers
dataset = lee[ , pick.features]
# perform the unsupervised classification, using Classic Delta
results = stylo(frequencies = dataset, gui = FALSE, display.on.screen = FALSE)
# pick the distance stored in the 1st row and 2nd column,
# e.g. the one between two different texts by Capote
final.distance = results$distance.table[1,2]
# add the new results to the already existing vector
final.results = c(final.results, final.distance)
}
The collected 100 distances are stored in the variable final.results
.
Basic statistics include the mean of all the distances and the standard
deviation:
mean(final.results)
## [1] 0.8665739
sd(final.results)
## [1] 0.09405702
The arithmetic mean is often considered a good proxy for the actual distance between two texts. However, the realization of how big is a dispersion between the shortest and the longest distance between the same pair of texts might be really disappointing:
min(final.results)
## [1] 0.6417331
max(final.results)
## [1] 1.123667
Quite a spread, huh? A better insight into the results will be provided by estimating the distribution of the computed 100 distances. A simple density plot will support the interpretation:
distances = density(final.results)
plot(distances, main = "Two novels by Capote")
Take the results obtained so far, as a caveat. The distance between a
given pair of texts might vary quite a lot, depending on the choice of
input features. Also, it will depend on the number of features tested at
a time. Try to edit the above code, so that in each iteration 1,000
random features are drawn instead of 100 features. (Edit the line that
invokes the function sample()
). How will the observed distribution
change?
The distribution of distances in a corpus
Now, what if we generalize the above code to cover all the texts in the corpus? The idea is to examine all pairwise distances between available texts, in order to see how wide the final distribution might be. The code discussed in the previous section needs but a minor tweak to achieve the goal:
library(stylo)
data(lee)
final.results = c()
for(iteration in 1:100) {
pick.features = sample(1:3000, 100)
dataset = lee[ , pick.features]
results = stylo(frequencies = dataset, gui = FALSE, display.on.screen = FALSE)
final.distance = as.vector(results$distance.table)
final.results = c(final.results, final.distance)
}
# since the values contain self-similarity tests as well,
# here's a dirty trick to get rid of 0 values from the results:
final.results = final.results[final.results > 0]
Again, we plot the distribution:
distances = density(final.results)
plot(distances, main = "Distances between 28 novels")
Quite striking is the observation that we deal here with a normal (or Gaussian) distribution, while in fact the pairwise comparisons involved (i) pairs of texts by the same author and (ii) pairs of texts by different authors. We should expect the first group to be narrower and the second group to be wider, and above all, we should expect to see a bi-modal distribution (i.e. with two humps) rather than a Gaussian one. The next session will address the issue.
Putting it all together
The final version of the script will resemble a typical authorship verification procedure. In short, here we will split the distances into the ones generated by texts of the same authorship (ingroup distribution), and texts of different authorship (outgroup distribution). For the sake of convenience, the preamble contains the settings to be adjusted in a real-life series of tests. Here’s the code:
library(stylo)
data(lee)
####### adjust the parameters #############
# pick a distance: "delta", "wurzburg", ...
preferred.distance = "wurzburg"
# choose a number of randomly picked features
random.features = 100
###########################################
final.results.same = c()
final.results.different = c()
for(iteration in 1:100) {
pick.features = sample(1:3000, random.features)
dataset = lee[ , pick.features]
results = stylo(frequencies = dataset, gui = FALSE, display.on.screen = FALSE, mfw.min = 3000, mfw.max = 3000, distance.measure = preferred.distance)
final.distance = results$distance.table
classes = gsub("_.*", "", rownames(final.distance))
for(n in 1:length(classes) ) {
same = classes[n] == classes
different = classes[n] != classes
same.distance = final.distance[n, same]
different.distance = final.distance[n, different]
}
final.results.same = c(final.results.same, same.distance)
final.results.same = final.results.same[final.results.same > 0]
final.results.different = c(final.results.different, different.distance)
}
# estimating the density funcion
density.same = density(final.results.same)
density.different = density(final.results.different)
We have completed a very solid test. (Seriously, it’s a real,
fully-fledged setup that can be used in an actual publication). It
deserves a plot somewhat more sophisticated than the plots presented in
the previous sections. I encourage you to analyze the following code
yourself, nonetheless worth mentioning are: semi-transparent colors set
in the first lines (fancy.green
, fancy.red
), determining the range
of the plot (max.y.value
, max.x.value
), calling an empty scatterplot
(plot(NULL)
), adding colored areas representing the distributions
(polygon()
), and finally adding a legend to the plot (legend()
).
Here it is:
# plotting the outgroup and the ingroup distances
fancy.green = rgb(0, 1, 0, 0.3)
fancy.red = rgb(1, 0, 0, 0.3)
max.y.value = max(c(density.same$y, density.different$y))
max.x.value = max(c(density.same$x, density.different$x))
plot(NULL, ylim = c(0, max.y.value), xlim = c(0, max.x.value), ylab = "density", xlab = "distance")
polygon(density.same, col = fancy.green)
polygon(density.different, col = fancy.red)
legend("topleft", legend = c("ingroup distances", "outgroup distances"), lty = 1, lwd = 10, col = c(fancy.green, fancy.red), bty = "n")
Again, try to replicate the test with different settings, e.g. different distance measures, or different number of features. As you will see, some parameters make the two distributions more distinct, whereas other settings barely differentiate the subsets. Try to determine the optimal set of parameters, and do find a moment to think why some factors play a role in making the authorial signals distinct.