Cross-validation using the function classify()
This post assumes that the reader is familiar with supervised
machine-learning classification methods and their main advantage, namely
the ability to assess the quality of the trained model. This can be
accomplished via cross-validation, or a number of swaps between the
training and the testing sets. There are several great introductions
into the fascinating world of machine-learning (cross-validation being
covered in almost all of them), including tons of materials on the
internet. I personally love the book on statistics with R by James and
his colleagues (2013). The following sections will be focused on two
functions provided the R package stylo
.
Performing cross-validation is relatively straightforward using the
function classify()
, without any manual swaps between the two sets.
You define your primary_set
and the secondary_set
, and then you
invoke the function indicating the number of cross-validation folds:
classify(cv.folds = 10)
or, if you want to have an access to particular cv folds:
# perform the classification:
results = classify(cv.folds = 10)
# get the classification accuracy:
results$cross.validation.summary
This will give you the stratified cross-validation, or the variant that
reproduces the representation of classes from your training_set
in N
random iterations.
Now, there is a function crossv()
that is meant to replace some core
fragments of classify()
in the future. I am not there yet, though. So
far, it is not fully functional. To perform leave-one-out
cross-validation, you prepare the training_set
only, and put your
stuff there. Then you have to load the corpus, and prepare a
document-term matrix. Let’s assume you’ve already got it:
library(stylo)
data(galbraith)
Type help(galbraith)
to see what the matrix contains. Then you
type:
crossv(training.set = galbraith, cv.mode = "leaveoneout", classification.method = "svm")
To build the document-term matrix, some more steps have to be undertaken beforehand:
library(stylo)
# loading the corpus
texts = load.corpus.and.parse(files = "all", corpus.dir = "corpus")
# getting a genral frequency list
freq.list = make.frequency.list(texts, head = 1000)
# preparing the document-term matrix:
word.frequencies = make.table.of.frequencies(corpus = texts, features = freq.list)
# now the main procedure takes place:
crossv(training.set = word.frequencies, cv.mode = "leaveoneout", classification.method = "svm")
Needless to say, it is wise to store the results of your cross-validation procedure in a variable rather than letting it fly away, hence the above code should be slightly refined:
# the same as above but saved to a variable:
results = crossv(training.set = word.frequencies, cv.mode = "leaveoneout", classification.method = "svm")
By piping the results into a variable, you can further assess the distribution of the accuracy scores acros the cross-validation folds, which tells you quite a lot about your corpus. There is a study (Eder and Rybicki, 2013) comparing different corpora and their distributions of authorship attribution accuracy scores under intense cross-validation scenarios.
References
Eder, M. and Rybicki, J. (2013). Do birds of a feather really flock together, or how to choose training samples for authorship attribution. Literary and Linguistic Computing, 28(2): 229-36, [pre-print].
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York: Springer.