Maciej Eder (maciej.eder@ijp.pan.pl)
Extracting keywords and topics from text collections
Maciej Eder
Institute of Polish Language (Polish Academy of Sciences)
Verona, 19.11.2024
The meaning of words lies in their use.
(Wittgenstein 1953: 80, 109)
You shall know a word by the company it keeps.
(Firth 1962: 11)
I have just returned from a visit to my landlord – the solitary neighbour that I shall be troubled with.
neighbour solitary troubled landlord visit returned just shall from I have be with my that a to the
heathcliff, linton, catherine, hareton, earnshaw, cathy, edgar, ellen, heights, hindley, nelly, ll, grange, i, wuthering, t, joseph, isabella, master, gimmerton, zillah, m, exclaimed, he, thrushcross, and, answered, yah, kenneth, ve, maister, lockwood, kitchen, you, dean, moors, replied, cried, him, muttered, lintons, papa, she, till, commenced, on, wer, ech, shoo, leant, hearth, bonny, door, stairs, hell, me, crags, moor, wouldn, fiend, settle, jabez, penistone, fire, ye, its, bid, nowt, naught, yer, hush, mistress, grew, lad, compelled, minny, won, hisseln, skulker, soa, wisht, cousin, lattice, didn, yon, minute, lass, needn, inquired, snow, branderham, flaysome, gooid, sud, thear, affirming, interrupted, couldn, window, …, …, …, a, in, the
the | and | to | i | of | … |
---|---|---|---|---|---|
3.557 | 2.938 | 3.398 | 1.696 | 2.965 | … |
3.919 | 4.068 | 2.988 | 3.078 | 1.905 | … |
the difference between the values:
the | and | to | i | of | … |
---|---|---|---|---|---|
-0.362 | -1.130 | 0.410 | -1.382 | 1.059 | … |
## [1] "of" "was" "her" "elizabeth" "to"
## [6] "be" "mr" "had" "not" "darcy"
## [11] "she" "very" "which" "in" "all"
## [16] "that" "they" "bennet" "their" "but"
## [21] "been" "such" "jane" "bingley" "could"
## [26] "much" "am" "so" "with" "mrs"
## [31] "as" "were" "them" "have" "for"
## [36] "is" "by" "sister" "it" "wickham"
## [41] "what" "will" "collins" "herself" "most"
## [46] "miss" "soon" "this" "lydia" "who"
## [51] "any" "dear" "family" "lady" "every"
## [56] "must" "know" "without" "though" "more"
## [61] "letter" "lizzy" "other" "think" "longbourn"
## [66] "many" "good" "gardiner" "mother" "do"
## [71] "no" "may" "great" "hope" "well"
## [76] "than" "ladies" "sisters" "time" "netherfield"
## [81] "manner" "friend" "can" "aunt" "kitty"
## [86] "charlotte" "feelings" "daughter" "before" "however"
## [91] "opinion" "colonel" "lucas" "whom" "marriage"
## [96] "always" "might" "nothing" "indeed" "happiness"
## [1] "elizabeth" "darcy" "bennet" "jane"
## [5] "bingley" "of" "wickham" "collins"
## [9] "very" "mr" "lydia" "which"
## [13] "such" "sister" "was" "their"
## [17] "lizzy" "much" "am" "longbourn"
## [21] "be" "all" "gardiner" "had"
## [25] "been" "most" "ladies" "netherfield"
## [29] "they" "kitty" "her" "charlotte"
## [33] "colonel" "lucas" "mrs" "sisters"
## [37] "dear" "family" "not" "herself"
## [41] "meryton" "opinion" "marriage" "pemberley"
## [45] "letter" "soon" "daughters" "aunt"
## [49] "bingley^s" "rosings" "town" "could"
## [53] "darcy^s" "lady" "party" "hertfordshire"
## [57] "daughter" "them" "de" "william"
## [61] "girls" "miss" "elizabeth^s" "forster"
## [65] "lydia^s" "happiness" "without" "many"
## [69] "bourgh" "mother" "so" "every"
## [73] "but" "gentlemen" "general" "known"
## [77] "feelings" "she" "therefore" "sister^s"
## [81] "london" "hurst" "wickham^s" "officers"
## [85] "fitzwilliam" "were" "whom" "manner"
## [89] "ladyship" "bennet^s" "civility" "friend"
## [93] "ball" "mary" "jane^s" "hope"
## [97] "certainly" "will" "any" "that"
\[\zeta_{(a,b)} = \left(\frac{f_a - f_b}{100}\right) +1\]
(Burrows, 2007; Craig, 2009)
## [1] "elizabeth" "jane" "darcy" "bennet"
## [5] "longbourn" "opinion" "wickham" "bingley"
## [9] "lizzy" "town" "sister" "lydia"
## [13] "ladies" "netherfield" "sisters" "meryton"
## [17] "general" "therefore" "darcy^s" "marriage"
## [21] "party" "london" "certainly" "bingley^s"
## [25] "hertfordshire" "aunt" "carriage" "known"
## [29] "settled" "others" "collins" "kitty"
## [33] "pemberley" "daughters" "elizabeth^s" "jane^s"
## [37] "sister^s" "engaged" "perfectly" "feelings"
## [41] "lucas" "bennet^s" "importance" "credit"
## [45] "manners" "happiness" "able" "behaviour"
## [49] "wickham^s" "girls" "most" "civility"
## [53] "sensible" "satisfied" "mentioned" "acquaintance"
## [57] "daughter" "affection" "attention" "whom"
## [61] "seeing" "charlotte" "rosings" "attentions"
## [65] "depend" "excellent" "believed" "assured"
## [69] "particularly" "necessary" "regard" "likely"
## [73] "conversation" "colonel" "many" "added"
## [77] "praise" "understanding" "mary" "highly"
## [81] "anyone" "address" "concern" "fortune"
## [85] "assure" "agreeable" "honour" "consequence"
## [89] "given" "possible" "mother" "gardiner"
## [93] "de" "lydia^s" "bourgh" "gentlemen"
## [97] "pleasing" "eldest" "particulars" "subject"
\[ TFIDF_{w} = f_{w} \times \log \frac{N}{n_{t}} \]
## [1] "darcy" "bennet" "bingley" "elizabeth"
## [5] "wickham" "collins" "lydia" "longbourn"
## [9] "gardiner" "netherfield" "lucas" "lizzy"
## [13] "meryton" "pemberley" "bingley^s" "rosings"
## [17] "kitty" "darcy^s" "hertfordshire" "elizabeth^s"
## [21] "forster" "lydia^s" "bourgh" "wickham^s"
## [25] "fitzwilliam" "jane" "catherine" "bennet^s"
## [29] "phillips" "hunsford" "brighton" "collins^s"
## [33] "derbyshire" "hurst" "charlotte" "colonel"
## [37] "officers" "william" "kent" "lucases"
## [41] "gardiner^s" "denny" "charlotte^s" "ladyship"
## [45] "judgement" "catherine^s" "de" "jane^s"
## [49] "bennets" "lambton" "reynolds" "maria"
## [53] "gracechurch" "etc" "anyone" "ladyship^s"
## [57] "eliza" "jenkinson" "parsonage" "elopement"
## [61] "behaviour" "georgiana" "gardiners" "entail"
## [65] "caroline" "regiment" "bingleys" "collinses"
## [69] "enumerating" "fitzwilliam^s" "lucas^s" "phillips^s"
## [73] "cousin" "saturday" "shire" "militia"
## [77] "corps" "james^s" "lakes" "bourgh^s"
## [81] "clapham" "fishing" "newcastle" "younge"
## [85] "ladies" "niece" "exposing" "felicity"
## [89] "imprudence" "uncommonly" "assembly" "wednesday"
## [93] "nieces" "engagements" "louisa" "twelvemonth"
## [97] "gratifying" "william^s" "annesley" "caroline^s"
word | count | word | count |
---|---|---|---|
the | 27595 | the | 28662 |
and | 26735 | and | 14109 |
I | 22538 | of | 13229 |
… | … | … | … |
northerly | 1 | revolvers | 1 |
\[P_{r(x)} = \frac{1}{N}\sum_{i=1}^{r(x)}f(i)\]
which also guarantees that the values are scaled into the range {0, 1}.
a simpler and more elegant version of the above:
\[P_{r(x)} = \left( \frac{1}{N}\sum_{i=1}^{r(x)}f(x_{i}) \right) - f(x)\]
which is of course equal to:
\[P_{r(x)} = \frac{1}{N}\sum_{i=1}^{r(x-1)}f(x_{i})\]
Having normalized the frequencies/positions of the words, one can easily compute the distance between the same word between two sets (texts or subcorpora), say A and B, by simply estimating their difference. This can be defined as
\[\delta_{x,AB} = \| P_{r(x),A} - P_{r(x),B} \|\]
The above geometric interpretation can be also solved algebraically. In this interpretation, the position \(P_{r(x)}\) of a given word is simply a cumulative sum of frequencies of all the preceding words.
## [1] "elizabeth" "darcy" "bennet" "jane"
## [5] "bingley" "wickham" "collins" "lydia"
## [9] "lizzy" "longbourn" "gardiner" "ladies"
## [13] "netherfield" "kitty" "charlotte" "colonel"
## [17] "lucas" "sisters" "meryton" "pemberley"
## [21] "daughters" "bingley^s" "rosings" "opinion"
## [25] "darcy^s" "marriage" "hertfordshire" "de"
## [29] "william" "elizabeth^s" "forster" "lydia^s"
## [33] "girls" "bourgh" "town" "sister^s"
## [37] "gentlemen" "sister" "hurst" "wickham^s"
## [41] "fitzwilliam" "officers" "ladyship" "bennet^s"
## [45] "party" "jane^s" "attentions" "phillips"
## [49] "ball" "hunsford" "brighton" "collins^s"
## [53] "derbyshire" "mary" "aunt" "importance"
## [57] "depend" "eliza" "pleasing" "eldest"
## [61] "general" "match" "particulars" "understanding"
## [65] "civility" "compliment" "praise" "niece"
## [69] "regiment" "concerned" "connections" "felicity"
## [73] "maria" "charlotte^s" "known" "parsonage"
## [77] "saturday" "street" "highly" "dearest"
## [81] "daughter" "london" "scheme" "favour"
## [85] "happiness" "dancing" "fortune" "engagement"
## [89] "prevailed" "resentment" "caroline" "relations"
## [93] "partner" "kent" "pounds" "lucases"
## [97] "occurred" "excellent" "etc" "gardiner^s"
## freqs zeta tfidf cracovian
## [1,] "of" "elizabeth" "darcy" "elizabeth"
## [2,] "was" "jane" "bennet" "darcy"
## [3,] "her" "darcy" "bingley" "bennet"
## [4,] "elizabeth" "bennet" "elizabeth" "jane"
## [5,] "to" "longbourn" "wickham" "bingley"
## [6,] "be" "opinion" "collins" "wickham"
## [7,] "mr" "wickham" "lydia" "collins"
## [8,] "had" "bingley" "longbourn" "lydia"
## [9,] "not" "lizzy" "gardiner" "lizzy"
## [10,] "darcy" "town" "netherfield" "longbourn"
## [11,] "she" "sister" "lucas" "gardiner"
## [12,] "very" "lydia" "lizzy" "ladies"
## [13,] "which" "ladies" "meryton" "netherfield"
## [14,] "in" "netherfield" "pemberley" "kitty"
## [15,] "all" "sisters" "bingley^s" "charlotte"
## [16,] "that" "meryton" "rosings" "colonel"
## [17,] "they" "general" "kitty" "lucas"
## [18,] "bennet" "therefore" "darcy^s" "sisters"
## [19,] "their" "darcy^s" "hertfordshire" "meryton"
## [20,] "but" "marriage" "elizabeth^s" "pemberley"
## [21,] "been" "party" "forster" "daughters"
## [22,] "such" "london" "lydia^s" "bingley^s"
## [23,] "jane" "certainly" "bourgh" "rosings"
## [24,] "bingley" "bingley^s" "wickham^s" "opinion"
## [25,] "could" "hertfordshire" "fitzwilliam" "darcy^s"
## [26,] "much" "aunt" "jane" "marriage"
## [27,] "am" "carriage" "catherine" "hertfordshire"
## [28,] "so" "known" "bennet^s" "de"
## [29,] "with" "settled" "phillips" "william"
## [30,] "mrs" "others" "hunsford" "elizabeth^s"
## [31,] "as" "collins" "brighton" "forster"
## [32,] "were" "kitty" "collins^s" "lydia^s"
## [33,] "them" "pemberley" "derbyshire" "girls"
## [34,] "have" "daughters" "hurst" "bourgh"
## [35,] "for" "elizabeth^s" "charlotte" "town"
## [36,] "is" "jane^s" "colonel" "sister^s"
## [37,] "by" "sister^s" "officers" "gentlemen"
## [38,] "sister" "engaged" "william" "sister"
## [39,] "it" "perfectly" "kent" "hurst"
## [40,] "wickham" "feelings" "lucases" "wickham^s"
## [41,] "what" "lucas" "gardiner^s" "fitzwilliam"
## [42,] "will" "bennet^s" "denny" "officers"
## [43,] "collins" "importance" "charlotte^s" "ladyship"
## [44,] "herself" "credit" "ladyship" "bennet^s"
## [45,] "most" "manners" "judgement" "party"
## [46,] "miss" "happiness" "catherine^s" "jane^s"
## [47,] "soon" "able" "de" "attentions"
## [48,] "this" "behaviour" "jane^s" "phillips"
## [49,] "lydia" "wickham^s" "bennets" "ball"
## [50,] "who" "girls" "lambton" "hunsford"
## [51,] "any" "most" "reynolds" "brighton"
## [52,] "dear" "civility" "maria" "collins^s"
## [53,] "family" "sensible" "gracechurch" "derbyshire"
## [54,] "lady" "satisfied" "etc" "mary"
## [55,] "every" "mentioned" "anyone" "aunt"
## [56,] "must" "acquaintance" "ladyship^s" "importance"
## [57,] "know" "daughter" "eliza" "depend"
## [58,] "without" "affection" "jenkinson" "eliza"
## [59,] "though" "attention" "parsonage" "pleasing"
## [60,] "more" "whom" "elopement" "eldest"
## [61,] "letter" "seeing" "behaviour" "general"
## [62,] "lizzy" "charlotte" "georgiana" "match"
## [63,] "other" "rosings" "gardiners" "particulars"
## [64,] "think" "attentions" "entail" "understanding"
## [65,] "longbourn" "depend" "caroline" "civility"
## [66,] "many" "excellent" "regiment" "compliment"
## [67,] "good" "believed" "bingleys" "praise"
## [68,] "gardiner" "assured" "collinses" "niece"
## [69,] "mother" "particularly" "enumerating" "regiment"
## [70,] "do" "necessary" "fitzwilliam^s" "concerned"
## [71,] "no" "regard" "lucas^s" "connections"
## [72,] "may" "likely" "phillips^s" "felicity"
## [73,] "great" "conversation" "cousin" "maria"
## [74,] "hope" "colonel" "saturday" "charlotte^s"
## [75,] "well" "many" "shire" "known"
## [76,] "than" "added" "militia" "parsonage"
## [77,] "ladies" "praise" "corps" "saturday"
## [78,] "sisters" "understanding" "james^s" "street"
## [79,] "time" "mary" "lakes" "highly"
## [80,] "netherfield" "highly" "bourgh^s" "dearest"
## [81,] "manner" "anyone" "clapham" "daughter"
## [82,] "friend" "address" "fishing" "london"
## [83,] "can" "concern" "newcastle" "scheme"
## [84,] "aunt" "fortune" "younge" "favour"
## [85,] "kitty" "assure" "ladies" "happiness"
## [86,] "charlotte" "agreeable" "niece" "dancing"
## [87,] "feelings" "honour" "exposing" "fortune"
## [88,] "daughter" "consequence" "felicity" "engagement"
## [89,] "before" "given" "imprudence" "prevailed"
## [90,] "however" "possible" "uncommonly" "resentment"
## [91,] "opinion" "mother" "assembly" "caroline"
## [92,] "colonel" "gardiner" "wednesday" "relations"
## [93,] "lucas" "de" "nieces" "partner"
## [94,] "whom" "lydia^s" "engagements" "kent"
## [95,] "marriage" "bourgh" "louisa" "pounds"
## [96,] "always" "gentlemen" "twelvemonth" "lucases"
## [97,] "might" "pleasing" "gratifying" "occurred"
## [98,] "nothing" "eldest" "william^s" "excellent"
## [99,] "indeed" "particulars" "annesley" "etc"
## [100,] "happiness" "subject" "caroline^s" "gardiner^s"
## [1] "maggie" "adam" "lydgate" "dorothea" "emma"
## [6] "lucy" "hetty" "elinor" "tulliver" "jane"
## [11] "elizabeth" "casaubon" "bulstrode" "rosamond" "marianne"
## [16] "weston" "fred" "dinah" "catherine" "heathcliff"
## [21] "was" "darcy" "tom" "poyser" "harriet"
## [26] "philip" "felt" "seemed" "knightley" "brooke"
## [31] "its" "fairfax" "linton" "glegg" "elton"
## [1] "upon" "lady" "thou" "lovelace" "jones" "phineas"
## [7] "my" "pen" "lord" "dear" "sir" "major"
## [13] "letter" "howe" "laura" "harlowe" "honour" "madam"
## [19] "duke" "george" "that" "belford" "crawley" "pendennis"
## [25] "sophia" "clarissa" "lopez" "micawber" "pamela" "b"
## [31] "finn" "so" "peggotty" "your" "captain"
## [1] "maggie" "lydgate" "adam" "dorothea" "hetty"
## [6] "lucy" "elinor" "tulliver" "casaubon" "rosamond"
## [11] "bulstrode" "marianne" "weston" "emma" "dinah"
## [16] "elizabeth" "heathcliff" "darcy" "poyser" "catherine"
## [21] "fred" "knightley" "brooke" "philip" "fairfax"
## [26] "glegg" "elton" "linton" "graham" "bennet"
## [31] "seth" "garth" "harriet" "middlemarch" "ladislaw"
## [1] "lovelace" "phineas" "jones" "major" "howe"
## [6] "harlowe" "belford" "crawley" "pendennis" "clarissa"
## [11] "lopez" "micawber" "sophia" "pamela" "laura"
## [16] "finn" "peggotty" "duke" "allworthy" "thou"
## [21] "amelia" "clavering" "osborne" "copperfield" "pen"
## [26] "wharton" "traddles" "bounderby" "dobbin" "rawdon"
## [31] "solmes" "b" "jarndyce" "dora" "eleanor"
## [1] "felt" "feelings" "feel" "feeling" "tone" "seemed"
## [7] "sense" "rose" "chapter" "quiet" "oh" "minutes"
## [13] "cold" "glance" "yes" "deep" "entered" "silence"
## [19] "looked" "added" "its" "smile" "turned" "something"
## [25] "seated" "scarcely" "speaking" "suddenly" "longer" "strong"
## [31] "i^ve" "paused" "towards" "continued"
## [1] "honour" "lord" "favour" "although" "friend" "lady"
## [7] "dear" "letter" "upon" "dearest" "honest" "gentleman"
## [13] "lovelace" "occasion" "sir" "sex" "letters" "london"
## [19] "harlowe" "worthy" "whom" "story" "therefore" "pen"
## [25] "thou" "madam" "creature" "tis" "wretch" "beloved"
## [31] "clarissa" "howe" "written" "write"
An example: \[ P(A) = 0.001 \quad\quad P(B) = 0.002 \quad\quad P(A) \times P(B) = 0.000002 \]
strong tea — *powerful tea
powerful computer — *strong computer
We formally define a topic to be a distribution over a fixed vocabulary. For example, the genetics topic has words about genetics with high probability and the evolutionary biology topic has words about evolutionary biology with high probability.
(Blei 2012, 78)
fight soldier arms war soldiers field fly sword horse valiant march battle brave messenger arm army trumpet valour kings camp alarum walls join wars slain tent forces gates drum courage trumpets lion town fought foes english armour city saint guard colours victory herald swords fame armed country wounds plain safe …
Indeed calling these models “topic models” is retrospective – the topics that emerge from the inference algorithm are interpretable for almost any collection that is analyzed. The fact that these look like topics has to do with the statistical structure of observed language and how it interacts with the specific probabilistic assumptions of LDA.
(Blei 2012, 79)
## [1] "sisters" "opinion" "marriage" "girls"
## [5] "town" "sister^s" "sister" "party"
## [9] "ball" "mary" "aunt" "general"
## [13] "match" "understanding" "civility" "compliment"
## [17] "praise" "known" "street" "highly"
## [21] "daughter" "london" "scheme" "favour"
## [25] "happiness" "dancing" "fortune" "engagement"
## [29] "resentment" "relations" "partner" "pounds"
## [33] "excellent" "evident" "invitation" "letter"
## [37] "wholly" "anyone" "settled" "sensible"
## [41] "women" "extraordinary" "therefore" "agreeable"
## [45] "madam" "tuesday" "possibility" "fortunate"
## [49] "credit" "endeavour" "dear" "most"
## [53] "success" "concern" "others" "family"
## [57] "grateful" "assure" "belief" "engaged"
## [61] "carriage" "gratitude" "terms" "honour"
## [65] "virtue" "enjoyment" "advantage" "unlucky"
## [69] "feelings" "acknowledged" "tolerable" "inquiries"
## [73] "such" "motive" "objections" "likewise"
## [77] "expectation" "noble" "moments" "attended"
## [81] "whom" "address" "manners" "satisfied"
## [85] "degree" "valuable" "remarkably" "believed"
## [89] "assured" "very" "uncle^s" "respect"
## [93] "thousand" "situation" "overcome" "drawing-room"
## [97] "visitors" "design" "receiving" "affair"
Jane Austen, Pride and Prejudice
## [1] "i^m" "don^t" "you^ll" "won^t" "kitchen" "master"
## [7] "lad" "papa" "bid" "can^t" "black" "fire"
## [13] "bed" "chamber" "lie" "exclaimed" "whispered" "there^s"
## [19] "o" "cry" "hands" "hair" "fell" "hearth"
## [25] "god" "arm" "soul" "boy" "dark" "old"
## [31] "remarked" "feet" "answered" "wind" "fingers" "dead"
## [37] "earth" "lay" "chair" "mouth" "rough" "tongue"
## [43] "corner" "seized" "white" "fit" "death" "mistress"
## [49] "teeth" "shut" "crying" "hold" "door" "die"
## [55] "sleep" "held" "water" "grew" "horse" "hurt"
## [61] "afternoon" "darling" "th" "breath" "hand" "face"
## [67] "gaze" "show" "presently" "round" "rose" "catherine"
## [73] "watch" "to-night" "sprang" "mine" "kill" "presence"
## [79] "hate" "frame" "child" "departed" "minute" "endure"
## [85] "ear" "bit" "stepped" "top" "bitter" "passion"
## [91] "master^s" "seat" "window" "hat" "died" "sit"
## [97] "settle" "ah" "got" "lock"
Emily Bronte, Wuthering Heights