Words that (might) matter, or keywords extraction methods

Maciej Eder (maciej.eder@ijp.pan.pl)

27.02.2023


Using and Developing Software for Keyness Analysis
University of Trier, 27–28.02.2023

Motivation

  • Information retrieval:
    • How to “read” a big collection of documents, e.g. an archive?
  • Linguistics:
    • What is the underlying model for defining word meaning?

Does an ideal keyword exist?

  • information retrieval:
    • “meaningful” words
    • what a given text is about
  • stylometry:
    • is the stylistic layer also important?
    • if a given author/genre overuses the word “the”, should that be captured? neglected?

Basic assumptions

  • Words are unevenly distributed in two text collections.
  • Distribution of a word in text corpora will always differ.
  • The real problem is to identify differences that are significant.
  • Keyword = a word significantly more frequent in a given text.

A simple idea…

I have just returned from a visit to my landlord – the solitary neighbour that I shall be troubled with.

neighbour solitary troubled landlord visit returned just shall from I have be with my that a to the

Bronte, The Wuthering Heights

heathcliff, linton, catherine, hareton, earnshaw, cathy, edgar, ellen, heights, hindley, nelly, ll, grange, i, wuthering, t, joseph, isabella, master, gimmerton, zillah, m, exclaimed, he, thrushcross, and, answered, yah, kenneth, ve, maister, lockwood, kitchen, you, dean, moors, replied, cried, him, muttered, lintons, papa, she, till, commenced, on, wer, ech, shoo, leant, hearth, bonny, door, stairs, hell, me, crags, moor, wouldn, fiend, settle, jabez, penistone, fire, ye, its, bid, nowt, naught, yer, hush, mistress, grew, lad, compelled, minny, won, hisseln, skulker, soa, wisht, cousin, lattice, didn, yon, minute, lass, needn, inquired, snow, branderham, flaysome, gooid, sud, thear, affirming, interrupted, couldn, window, …, …, …, a, in, the

Frequencies?

Frequencies in two texts

Differences between frequencies

the and to i of
3.557 2.938 3.398 1.696 2.965
3.919 4.068 2.988 3.078 1.905

the difference between the values:

the and to i of
-0.362 -1.130 0.410 -1.382 1.059

Differences between frequencies

The most distinctive words

##   [1] "of"          "was"         "her"         "elizabeth"   "to"         
##   [6] "be"          "mr"          "had"         "not"         "darcy"      
##  [11] "she"         "very"        "which"       "in"          "all"        
##  [16] "that"        "they"        "bennet"      "their"       "but"        
##  [21] "been"        "such"        "jane"        "bingley"     "could"      
##  [26] "much"        "am"          "so"          "with"        "mrs"        
##  [31] "as"          "were"        "them"        "have"        "for"        
##  [36] "is"          "by"          "sister"      "it"          "wickham"    
##  [41] "what"        "will"        "collins"     "herself"     "most"       
##  [46] "miss"        "soon"        "this"        "lydia"       "who"        
##  [51] "any"         "dear"        "family"      "lady"        "every"      
##  [56] "must"        "know"        "without"     "though"      "more"       
##  [61] "letter"      "lizzy"       "other"       "think"       "longbourn"  
##  [66] "many"        "good"        "gardiner"    "mother"      "do"         
##  [71] "no"          "may"         "great"       "hope"        "well"       
##  [76] "than"        "ladies"      "sisters"     "time"        "netherfield"
##  [81] "manner"      "friend"      "can"         "aunt"        "kitty"      
##  [86] "charlotte"   "feelings"    "daughter"    "before"      "however"    
##  [91] "opinion"     "colonel"     "lucas"       "whom"        "marriage"   
##  [96] "always"      "might"       "nothing"     "indeed"      "happiness"

Keywords LL

Keywords analysis

  • log likelihood used to capture the differences
  • implementation: AntConc, WordSmith Tools, etc.

Keywords analysis

##   [1] "elizabeth"     "darcy"         "bennet"        "jane"         
##   [5] "bingley"       "of"            "wickham"       "collins"      
##   [9] "very"          "mr"            "lydia"         "which"        
##  [13] "such"          "sister"        "was"           "their"        
##  [17] "lizzy"         "much"          "am"            "longbourn"    
##  [21] "be"            "all"           "gardiner"      "had"          
##  [25] "been"          "most"          "ladies"        "netherfield"  
##  [29] "they"          "kitty"         "her"           "charlotte"    
##  [33] "colonel"       "lucas"         "mrs"           "sisters"      
##  [37] "dear"          "family"        "not"           "herself"      
##  [41] "meryton"       "opinion"       "marriage"      "pemberley"    
##  [45] "letter"        "soon"          "daughters"     "aunt"         
##  [49] "bingley^s"     "rosings"       "town"          "could"        
##  [53] "darcy^s"       "lady"          "party"         "hertfordshire"
##  [57] "daughter"      "them"          "de"            "william"      
##  [61] "girls"         "miss"          "elizabeth^s"   "forster"      
##  [65] "lydia^s"       "happiness"     "without"       "many"         
##  [69] "bourgh"        "mother"        "so"            "every"        
##  [73] "but"           "gentlemen"     "general"       "known"        
##  [77] "feelings"      "she"           "therefore"     "sister^s"     
##  [81] "london"        "hurst"         "wickham^s"     "officers"     
##  [85] "fitzwilliam"   "were"          "whom"          "manner"       
##  [89] "ladyship"      "bennet^s"      "civility"      "friend"       
##  [93] "ball"          "mary"          "jane^s"        "hope"         
##  [97] "certainly"     "will"          "any"           "that"

Zeta

Zeta: the background

Zeta: the solution

\[\zeta_{(a,b)} = \left(\frac{f_a - f_b}{100}\right) +1\]

(Burrows, 2007; Craig, 2009)

Zeta: outcomes

Zeta’s words

##   [1] "elizabeth"     "jane"          "darcy"         "bennet"       
##   [5] "longbourn"     "opinion"       "wickham"       "bingley"      
##   [9] "lizzy"         "town"          "sister"        "lydia"        
##  [13] "ladies"        "netherfield"   "sisters"       "meryton"      
##  [17] "general"       "therefore"     "darcy^s"       "marriage"     
##  [21] "party"         "london"        "certainly"     "bingley^s"    
##  [25] "hertfordshire" "aunt"          "carriage"      "known"        
##  [29] "settled"       "others"        "collins"       "kitty"        
##  [33] "pemberley"     "daughters"     "elizabeth^s"   "jane^s"       
##  [37] "sister^s"      "engaged"       "perfectly"     "feelings"     
##  [41] "lucas"         "bennet^s"      "importance"    "credit"       
##  [45] "manners"       "happiness"     "able"          "behaviour"    
##  [49] "wickham^s"     "girls"         "most"          "civility"     
##  [53] "sensible"      "satisfied"     "mentioned"     "acquaintance" 
##  [57] "daughter"      "affection"     "attention"     "whom"         
##  [61] "seeing"        "charlotte"     "rosings"       "attentions"   
##  [65] "depend"        "excellent"     "believed"      "assured"      
##  [69] "particularly"  "necessary"     "regard"        "likely"       
##  [73] "conversation"  "colonel"       "many"          "added"        
##  [77] "praise"        "understanding" "mary"          "highly"       
##  [81] "anyone"        "address"       "concern"       "fortune"      
##  [85] "assure"        "agreeable"     "honour"        "consequence"  
##  [89] "given"         "possible"      "mother"        "gardiner"     
##  [93] "de"            "lydia^s"       "bourgh"        "gentlemen"    
##  [97] "pleasing"      "eldest"        "particulars"   "subject"

TF–IDF

TF–IDF

  • Term frequency x inversed document frequency
  • IDF: a weighting factor to boost important frequencies
  • important being the ones that appear in few texts and yet, are frequent

\[ TFIDF_{w} = f_{w} \times \log \frac{N}{n_{t}} \]

TF–IDF’s words

##   [1] "darcy"         "bennet"        "bingley"       "elizabeth"    
##   [5] "wickham"       "collins"       "lydia"         "longbourn"    
##   [9] "gardiner"      "netherfield"   "lucas"         "lizzy"        
##  [13] "meryton"       "pemberley"     "bingley^s"     "rosings"      
##  [17] "kitty"         "darcy^s"       "hertfordshire" "elizabeth^s"  
##  [21] "forster"       "lydia^s"       "bourgh"        "wickham^s"    
##  [25] "fitzwilliam"   "jane"          "catherine"     "bennet^s"     
##  [29] "phillips"      "hunsford"      "brighton"      "collins^s"    
##  [33] "derbyshire"    "hurst"         "charlotte"     "colonel"      
##  [37] "officers"      "william"       "kent"          "lucases"      
##  [41] "gardiner^s"    "denny"         "charlotte^s"   "ladyship"     
##  [45] "judgement"     "catherine^s"   "de"            "jane^s"       
##  [49] "bennets"       "lambton"       "reynolds"      "maria"        
##  [53] "gracechurch"   "etc"           "anyone"        "ladyship^s"   
##  [57] "eliza"         "jenkinson"     "parsonage"     "elopement"    
##  [61] "behaviour"     "georgiana"     "gardiners"     "entail"       
##  [65] "caroline"      "regiment"      "bingleys"      "collinses"    
##  [69] "enumerating"   "fitzwilliam^s" "lucas^s"       "phillips^s"   
##  [73] "cousin"        "saturday"      "shire"         "militia"      
##  [77] "corps"         "james^s"       "lakes"         "bourgh^s"     
##  [81] "clapham"       "fishing"       "newcastle"     "younge"       
##  [85] "ladies"        "niece"         "exposing"      "felicity"     
##  [89] "imprudence"    "uncommonly"    "assembly"      "wednesday"    
##  [93] "nieces"        "engagements"   "louisa"        "twelvemonth"  
##  [97] "gratifying"    "william^s"     "annesley"      "caroline^s"

A new method

An ideal measure

  1. will automatically extract meaningful words for a text A and B
    • by “automatically” we mean that the technique will be unsupervised
    • by “meaningful” we mean that an extracted keyword brings much semantic information about a given subset of texts.
  2. will be able to distinguish if the texts to be compared differ substantially, or only marginally.
  3. will not require any computation-intense algorithms;
  4. will be conceptually simple;
  5. will have an intuitive linguistic interpretation;

An ideal measure (cont.)

  1. will not make any statistical assumptions, i.e. will be non-parametric;
  2. will allow for significance testing of the identified keywords;
  3. will not require any hyperparameters, or at least any arbitrarily chosen hyperparameters;
  4. will correct for frequent words.

A new measure

  • Given the above constrains, we propose a simple measure of keyness that relies on:
    1. words’ frequencies,
    2. words’ positions on a frequency list.
  • Since (2) is directly derivable from (1), the only thing that the measure requires is two lists of word frequencies.
  • Mathematically, then, the only assumption we formulate is that there exist two texts of a finite length of \(\lambda \in \mathbb{N}\).

Shakespeare vs. Conan Doyle

word count word count
the 27595 the 28662
and 26735 and 14109
I 22538 of 13229
northerly 1 revolvers 1

Frequency and position

\[P_{r(x)} = \frac{1}{N}\sum_{i=1}^{r(x)}f(i)\]

which also guarantees that the values are scaled into the range {0, 1}.

Frequency and position

a simpler and more elegant version of the above:

\[P_{r(x)} = \left( \frac{1}{N}\sum_{i=1}^{r(x)}f(x_{i}) \right) - f(x)\]

which is of course equal to:

\[P_{r(x)} = \frac{1}{N}\sum_{i=1}^{r(x-1)}f(x_{i})\]

The cracovian

Having normalized the frequencies/positions of the words, one can easily compute the distance between the same word between two sets (texts or subcorpora), say A and B, by simply estimating their difference. This can be defined as

\[\delta_{x,AB} = \| P_{r(x),A} - P_{r(x),B} \|\]

Cumulative sums

The above geometric interpretation can be also solved algebraically. In this interpretation, the position \(P_{r(x)}\) of a given word is simply a cumulative sum of frequencies of all the preceding words.

The words

##   [1] "elizabeth"     "darcy"         "bennet"        "jane"         
##   [5] "bingley"       "wickham"       "collins"       "lydia"        
##   [9] "lizzy"         "longbourn"     "gardiner"      "ladies"       
##  [13] "netherfield"   "kitty"         "charlotte"     "colonel"      
##  [17] "lucas"         "sisters"       "meryton"       "pemberley"    
##  [21] "daughters"     "bingley^s"     "rosings"       "opinion"      
##  [25] "darcy^s"       "marriage"      "hertfordshire" "de"           
##  [29] "william"       "elizabeth^s"   "forster"       "lydia^s"      
##  [33] "girls"         "bourgh"        "town"          "sister^s"     
##  [37] "gentlemen"     "sister"        "hurst"         "wickham^s"    
##  [41] "fitzwilliam"   "officers"      "ladyship"      "bennet^s"     
##  [45] "party"         "jane^s"        "attentions"    "phillips"     
##  [49] "ball"          "hunsford"      "brighton"      "collins^s"    
##  [53] "derbyshire"    "mary"          "aunt"          "importance"   
##  [57] "depend"        "eliza"         "pleasing"      "eldest"       
##  [61] "general"       "match"         "particulars"   "understanding"
##  [65] "civility"      "compliment"    "praise"        "niece"        
##  [69] "regiment"      "concerned"     "connections"   "felicity"     
##  [73] "maria"         "charlotte^s"   "known"         "parsonage"    
##  [77] "saturday"      "street"        "highly"        "dearest"      
##  [81] "daughter"      "london"        "scheme"        "favour"       
##  [85] "happiness"     "dancing"       "fortune"       "engagement"   
##  [89] "prevailed"     "resentment"    "caroline"      "relations"    
##  [93] "partner"       "kent"          "pounds"        "lucases"      
##  [97] "occurred"      "excellent"     "etc"           "gardiner^s"

Similar yet not identical

##        freqs         zeta            tfidf           cracovian      
##   [1,] "of"          "elizabeth"     "darcy"         "elizabeth"    
##   [2,] "was"         "jane"          "bennet"        "darcy"        
##   [3,] "her"         "darcy"         "bingley"       "bennet"       
##   [4,] "elizabeth"   "bennet"        "elizabeth"     "jane"         
##   [5,] "to"          "longbourn"     "wickham"       "bingley"      
##   [6,] "be"          "opinion"       "collins"       "wickham"      
##   [7,] "mr"          "wickham"       "lydia"         "collins"      
##   [8,] "had"         "bingley"       "longbourn"     "lydia"        
##   [9,] "not"         "lizzy"         "gardiner"      "lizzy"        
##  [10,] "darcy"       "town"          "netherfield"   "longbourn"    
##  [11,] "she"         "sister"        "lucas"         "gardiner"     
##  [12,] "very"        "lydia"         "lizzy"         "ladies"       
##  [13,] "which"       "ladies"        "meryton"       "netherfield"  
##  [14,] "in"          "netherfield"   "pemberley"     "kitty"        
##  [15,] "all"         "sisters"       "bingley^s"     "charlotte"    
##  [16,] "that"        "meryton"       "rosings"       "colonel"      
##  [17,] "they"        "general"       "kitty"         "lucas"        
##  [18,] "bennet"      "therefore"     "darcy^s"       "sisters"      
##  [19,] "their"       "darcy^s"       "hertfordshire" "meryton"      
##  [20,] "but"         "marriage"      "elizabeth^s"   "pemberley"    
##  [21,] "been"        "party"         "forster"       "daughters"    
##  [22,] "such"        "london"        "lydia^s"       "bingley^s"    
##  [23,] "jane"        "certainly"     "bourgh"        "rosings"      
##  [24,] "bingley"     "bingley^s"     "wickham^s"     "opinion"      
##  [25,] "could"       "hertfordshire" "fitzwilliam"   "darcy^s"      
##  [26,] "much"        "aunt"          "jane"          "marriage"     
##  [27,] "am"          "carriage"      "catherine"     "hertfordshire"
##  [28,] "so"          "known"         "bennet^s"      "de"           
##  [29,] "with"        "settled"       "phillips"      "william"      
##  [30,] "mrs"         "others"        "hunsford"      "elizabeth^s"  
##  [31,] "as"          "collins"       "brighton"      "forster"      
##  [32,] "were"        "kitty"         "collins^s"     "lydia^s"      
##  [33,] "them"        "pemberley"     "derbyshire"    "girls"        
##  [34,] "have"        "daughters"     "hurst"         "bourgh"       
##  [35,] "for"         "elizabeth^s"   "charlotte"     "town"         
##  [36,] "is"          "jane^s"        "colonel"       "sister^s"     
##  [37,] "by"          "sister^s"      "officers"      "gentlemen"    
##  [38,] "sister"      "engaged"       "william"       "sister"       
##  [39,] "it"          "perfectly"     "kent"          "hurst"        
##  [40,] "wickham"     "feelings"      "lucases"       "wickham^s"    
##  [41,] "what"        "lucas"         "gardiner^s"    "fitzwilliam"  
##  [42,] "will"        "bennet^s"      "denny"         "officers"     
##  [43,] "collins"     "importance"    "charlotte^s"   "ladyship"     
##  [44,] "herself"     "credit"        "ladyship"      "bennet^s"     
##  [45,] "most"        "manners"       "judgement"     "party"        
##  [46,] "miss"        "happiness"     "catherine^s"   "jane^s"       
##  [47,] "soon"        "able"          "de"            "attentions"   
##  [48,] "this"        "behaviour"     "jane^s"        "phillips"     
##  [49,] "lydia"       "wickham^s"     "bennets"       "ball"         
##  [50,] "who"         "girls"         "lambton"       "hunsford"     
##  [51,] "any"         "most"          "reynolds"      "brighton"     
##  [52,] "dear"        "civility"      "maria"         "collins^s"    
##  [53,] "family"      "sensible"      "gracechurch"   "derbyshire"   
##  [54,] "lady"        "satisfied"     "etc"           "mary"         
##  [55,] "every"       "mentioned"     "anyone"        "aunt"         
##  [56,] "must"        "acquaintance"  "ladyship^s"    "importance"   
##  [57,] "know"        "daughter"      "eliza"         "depend"       
##  [58,] "without"     "affection"     "jenkinson"     "eliza"        
##  [59,] "though"      "attention"     "parsonage"     "pleasing"     
##  [60,] "more"        "whom"          "elopement"     "eldest"       
##  [61,] "letter"      "seeing"        "behaviour"     "general"      
##  [62,] "lizzy"       "charlotte"     "georgiana"     "match"        
##  [63,] "other"       "rosings"       "gardiners"     "particulars"  
##  [64,] "think"       "attentions"    "entail"        "understanding"
##  [65,] "longbourn"   "depend"        "caroline"      "civility"     
##  [66,] "many"        "excellent"     "regiment"      "compliment"   
##  [67,] "good"        "believed"      "bingleys"      "praise"       
##  [68,] "gardiner"    "assured"       "collinses"     "niece"        
##  [69,] "mother"      "particularly"  "enumerating"   "regiment"     
##  [70,] "do"          "necessary"     "fitzwilliam^s" "concerned"    
##  [71,] "no"          "regard"        "lucas^s"       "connections"  
##  [72,] "may"         "likely"        "phillips^s"    "felicity"     
##  [73,] "great"       "conversation"  "cousin"        "maria"        
##  [74,] "hope"        "colonel"       "saturday"      "charlotte^s"  
##  [75,] "well"        "many"          "shire"         "known"        
##  [76,] "than"        "added"         "militia"       "parsonage"    
##  [77,] "ladies"      "praise"        "corps"         "saturday"     
##  [78,] "sisters"     "understanding" "james^s"       "street"       
##  [79,] "time"        "mary"          "lakes"         "highly"       
##  [80,] "netherfield" "highly"        "bourgh^s"      "dearest"      
##  [81,] "manner"      "anyone"        "clapham"       "daughter"     
##  [82,] "friend"      "address"       "fishing"       "london"       
##  [83,] "can"         "concern"       "newcastle"     "scheme"       
##  [84,] "aunt"        "fortune"       "younge"        "favour"       
##  [85,] "kitty"       "assure"        "ladies"        "happiness"    
##  [86,] "charlotte"   "agreeable"     "niece"         "dancing"      
##  [87,] "feelings"    "honour"        "exposing"      "fortune"      
##  [88,] "daughter"    "consequence"   "felicity"      "engagement"   
##  [89,] "before"      "given"         "imprudence"    "prevailed"    
##  [90,] "however"     "possible"      "uncommonly"    "resentment"   
##  [91,] "opinion"     "mother"        "assembly"      "caroline"     
##  [92,] "colonel"     "gardiner"      "wednesday"     "relations"    
##  [93,] "lucas"       "de"            "nieces"        "partner"      
##  [94,] "whom"        "lydia^s"       "engagements"   "kent"         
##  [95,] "marriage"    "bourgh"        "louisa"        "pounds"       
##  [96,] "always"      "gentlemen"     "twelvemonth"   "lucases"      
##  [97,] "might"       "pleasing"      "gratifying"    "occurred"     
##  [98,] "nothing"     "eldest"        "william^s"     "excellent"    
##  [99,] "indeed"      "particulars"   "annesley"      "etc"          
## [100,] "happiness"   "subject"       "caroline^s"    "gardiner^s"

Comparison

Gender keywords (19th century)

  • 12 novels by female authors
    • Anne Bronte, Jane Austen, Charlotte Bronte, Emily Bronte, George Eliot
  • 12 novels by male authors
    • Charles Dickens, Henry Fielding, Samuel Richardson, William Thackeray, Anthony Trollope
  • A comparison of keywords returned by:
    • LL keywords analysis methdod
    • Cracovian
    • Zeta

LL keywords

  • female (?)
##  [1] "maggie"     "adam"       "lydgate"    "dorothea"   "emma"      
##  [6] "lucy"       "hetty"      "elinor"     "tulliver"   "jane"      
## [11] "elizabeth"  "casaubon"   "bulstrode"  "rosamond"   "marianne"  
## [16] "weston"     "fred"       "dinah"      "catherine"  "heathcliff"
## [21] "was"        "darcy"      "tom"        "poyser"     "harriet"   
## [26] "philip"     "felt"       "seemed"     "knightley"  "brooke"    
## [31] "its"        "fairfax"    "linton"     "glegg"      "elton"
  • male (?)
##  [1] "upon"      "lady"      "thou"      "lovelace"  "jones"     "phineas"  
##  [7] "my"        "pen"       "lord"      "dear"      "sir"       "major"    
## [13] "letter"    "howe"      "laura"     "harlowe"   "honour"    "madam"    
## [19] "duke"      "george"    "that"      "belford"   "crawley"   "pendennis"
## [25] "sophia"    "clarissa"  "lopez"     "micawber"  "pamela"    "b"        
## [31] "finn"      "so"        "peggotty"  "your"      "captain"

Cracovian

  • female (?)
##  [1] "maggie"      "lydgate"     "adam"        "dorothea"    "hetty"      
##  [6] "lucy"        "elinor"      "tulliver"    "casaubon"    "rosamond"   
## [11] "bulstrode"   "marianne"    "weston"      "emma"        "dinah"      
## [16] "elizabeth"   "heathcliff"  "darcy"       "poyser"      "catherine"  
## [21] "fred"        "knightley"   "brooke"      "philip"      "fairfax"    
## [26] "glegg"       "elton"       "linton"      "graham"      "bennet"     
## [31] "seth"        "garth"       "harriet"     "middlemarch" "ladislaw"
  • male (?)
##  [1] "lovelace"    "phineas"     "jones"       "major"       "howe"       
##  [6] "harlowe"     "belford"     "crawley"     "pendennis"   "clarissa"   
## [11] "lopez"       "micawber"    "sophia"      "pamela"      "laura"      
## [16] "finn"        "peggotty"    "duke"        "allworthy"   "thou"       
## [21] "amelia"      "clavering"   "osborne"     "copperfield" "pen"        
## [26] "wharton"     "traddles"    "bounderby"   "dobbin"      "rawdon"     
## [31] "solmes"      "b"           "jarndyce"    "dora"        "eleanor"

Zeta

  • women are from Venus…
##  [1] "felt"      "feelings"  "feel"      "feeling"   "tone"      "seemed"   
##  [7] "sense"     "rose"      "chapter"   "quiet"     "oh"        "minutes"  
## [13] "cold"      "glance"    "yes"       "deep"      "entered"   "silence"  
## [19] "looked"    "added"     "its"       "smile"     "turned"    "something"
## [25] "seated"    "scarcely"  "speaking"  "suddenly"  "longer"    "strong"   
## [31] "i^ve"      "paused"    "towards"   "continued"
  • …and men are from Mars
##  [1] "honour"    "lord"      "favour"    "although"  "friend"    "lady"     
##  [7] "dear"      "letter"    "upon"      "dearest"   "honest"    "gentleman"
## [13] "lovelace"  "occasion"  "sir"       "sex"       "letters"   "london"   
## [19] "harlowe"   "worthy"    "whom"      "story"     "therefore" "pen"      
## [25] "thou"      "madam"     "creature"  "tis"       "wretch"    "beloved"  
## [31] "clarissa"  "howe"      "written"   "write"

Conclusions

Conclusions

  • Keywords are all the words ordered from the most to the least relevant.
  • For a given text, the extracted keywords depend on the method used.
  • No cut-off point between keywords and “ordinary” words.
  • The new method (the Cracovian) competitive yet simpler than the time-proven solutions.
  • In real-life applications, the method Zeta seems to outperform other techniques…
  • …but systematic evaluation is needed to scrutinize this observation.

Thank you!

Appendix

Intersection of the two sets

##   [1] "sisters"       "opinion"       "marriage"      "girls"        
##   [5] "town"          "sister^s"      "sister"        "party"        
##   [9] "ball"          "mary"          "aunt"          "general"      
##  [13] "match"         "understanding" "civility"      "compliment"   
##  [17] "praise"        "known"         "street"        "highly"       
##  [21] "daughter"      "london"        "scheme"        "favour"       
##  [25] "happiness"     "dancing"       "fortune"       "engagement"   
##  [29] "resentment"    "relations"     "partner"       "pounds"       
##  [33] "excellent"     "evident"       "invitation"    "letter"       
##  [37] "wholly"        "anyone"        "settled"       "sensible"     
##  [41] "women"         "extraordinary" "therefore"     "agreeable"    
##  [45] "madam"         "tuesday"       "possibility"   "fortunate"    
##  [49] "credit"        "endeavour"     "dear"          "most"         
##  [53] "success"       "concern"       "others"        "family"       
##  [57] "grateful"      "assure"        "belief"        "engaged"      
##  [61] "carriage"      "gratitude"     "terms"         "honour"       
##  [65] "virtue"        "enjoyment"     "advantage"     "unlucky"      
##  [69] "feelings"      "acknowledged"  "tolerable"     "inquiries"    
##  [73] "such"          "motive"        "objections"    "likewise"     
##  [77] "expectation"   "noble"         "moments"       "attended"     
##  [81] "whom"          "address"       "manners"       "satisfied"    
##  [85] "degree"        "valuable"      "remarkably"    "believed"     
##  [89] "assured"       "very"          "uncle^s"       "respect"      
##  [93] "thousand"      "situation"     "overcome"      "drawing-room" 
##  [97] "visitors"      "design"        "receiving"     "affair"

Jane Austen, Pride and Prejudice

Words for the other set

##   [1] "i^m"       "don^t"     "you^ll"    "won^t"     "kitchen"   "master"   
##   [7] "lad"       "papa"      "bid"       "can^t"     "black"     "fire"     
##  [13] "bed"       "chamber"   "lie"       "exclaimed" "whispered" "there^s"  
##  [19] "o"         "cry"       "hands"     "hair"      "fell"      "hearth"   
##  [25] "god"       "arm"       "soul"      "boy"       "dark"      "old"      
##  [31] "remarked"  "feet"      "answered"  "wind"      "fingers"   "dead"     
##  [37] "earth"     "lay"       "chair"     "mouth"     "rough"     "tongue"   
##  [43] "corner"    "seized"    "white"     "fit"       "death"     "mistress" 
##  [49] "teeth"     "shut"      "crying"    "hold"      "door"      "die"      
##  [55] "sleep"     "held"      "water"     "grew"      "horse"     "hurt"     
##  [61] "afternoon" "darling"   "th"        "breath"    "hand"      "face"     
##  [67] "gaze"      "show"      "presently" "round"     "rose"      "catherine"
##  [73] "watch"     "to-night"  "sprang"    "mine"      "kill"      "presence" 
##  [79] "hate"      "frame"     "child"     "departed"  "minute"    "endure"   
##  [85] "ear"       "bit"       "stepped"   "top"       "bitter"    "passion"  
##  [91] "master^s"  "seat"      "window"    "hat"       "died"      "sit"      
##  [97] "settle"    "ah"        "got"       "lock"

Emily Bronte, Wuthering Heights