Extracting keywords and topics from text collections

Maciej Eder (maciej.eder@ijp.pan.pl)

Extracting keywords and topics
from text collections

Maciej Eder

Institute of Polish Language (Polish Academy of Sciences)

Verona, 19.11.2024

Motivation

Meaning defined by the context

The meaning of words lies in their use.

(Wittgenstein 1953: 80, 109)

You shall know a word by the company it keeps.

(Firth 1962: 11)

Distributional semantics

Keywords

Does an ideal keyword exist?

Basic assumptions

A simple idea…

I have just returned from a visit to my landlord – the solitary neighbour that I shall be troubled with.

neighbour solitary troubled landlord visit returned just shall from I have be with my that a to the

Bronte, The Wuthering Heights

heathcliff, linton, catherine, hareton, earnshaw, cathy, edgar, ellen, heights, hindley, nelly, ll, grange, i, wuthering, t, joseph, isabella, master, gimmerton, zillah, m, exclaimed, he, thrushcross, and, answered, yah, kenneth, ve, maister, lockwood, kitchen, you, dean, moors, replied, cried, him, muttered, lintons, papa, she, till, commenced, on, wer, ech, shoo, leant, hearth, bonny, door, stairs, hell, me, crags, moor, wouldn, fiend, settle, jabez, penistone, fire, ye, its, bid, nowt, naught, yer, hush, mistress, grew, lad, compelled, minny, won, hisseln, skulker, soa, wisht, cousin, lattice, didn, yon, minute, lass, needn, inquired, snow, branderham, flaysome, gooid, sud, thear, affirming, interrupted, couldn, window, …, …, …, a, in, the

Frequencies?

Frequencies in two texts

Differences between frequencies

the and to i of
3.557 2.938 3.398 1.696 2.965
3.919 4.068 2.988 3.078 1.905

the difference between the values:

the and to i of
-0.362 -1.130 0.410 -1.382 1.059

Differences between frequencies

The most distinctive words

##   [1] "of"          "was"         "her"         "elizabeth"   "to"         
##   [6] "be"          "mr"          "had"         "not"         "darcy"      
##  [11] "she"         "very"        "which"       "in"          "all"        
##  [16] "that"        "they"        "bennet"      "their"       "but"        
##  [21] "been"        "such"        "jane"        "bingley"     "could"      
##  [26] "much"        "am"          "so"          "with"        "mrs"        
##  [31] "as"          "were"        "them"        "have"        "for"        
##  [36] "is"          "by"          "sister"      "it"          "wickham"    
##  [41] "what"        "will"        "collins"     "herself"     "most"       
##  [46] "miss"        "soon"        "this"        "lydia"       "who"        
##  [51] "any"         "dear"        "family"      "lady"        "every"      
##  [56] "must"        "know"        "without"     "though"      "more"       
##  [61] "letter"      "lizzy"       "other"       "think"       "longbourn"  
##  [66] "many"        "good"        "gardiner"    "mother"      "do"         
##  [71] "no"          "may"         "great"       "hope"        "well"       
##  [76] "than"        "ladies"      "sisters"     "time"        "netherfield"
##  [81] "manner"      "friend"      "can"         "aunt"        "kitty"      
##  [86] "charlotte"   "feelings"    "daughter"    "before"      "however"    
##  [91] "opinion"     "colonel"     "lucas"       "whom"        "marriage"   
##  [96] "always"      "might"       "nothing"     "indeed"      "happiness"

Keywords LL

Keywords analysis

Keywords analysis

##   [1] "elizabeth"     "darcy"         "bennet"        "jane"         
##   [5] "bingley"       "of"            "wickham"       "collins"      
##   [9] "very"          "mr"            "lydia"         "which"        
##  [13] "such"          "sister"        "was"           "their"        
##  [17] "lizzy"         "much"          "am"            "longbourn"    
##  [21] "be"            "all"           "gardiner"      "had"          
##  [25] "been"          "most"          "ladies"        "netherfield"  
##  [29] "they"          "kitty"         "her"           "charlotte"    
##  [33] "colonel"       "lucas"         "mrs"           "sisters"      
##  [37] "dear"          "family"        "not"           "herself"      
##  [41] "meryton"       "opinion"       "marriage"      "pemberley"    
##  [45] "letter"        "soon"          "daughters"     "aunt"         
##  [49] "bingley^s"     "rosings"       "town"          "could"        
##  [53] "darcy^s"       "lady"          "party"         "hertfordshire"
##  [57] "daughter"      "them"          "de"            "william"      
##  [61] "girls"         "miss"          "elizabeth^s"   "forster"      
##  [65] "lydia^s"       "happiness"     "without"       "many"         
##  [69] "bourgh"        "mother"        "so"            "every"        
##  [73] "but"           "gentlemen"     "general"       "known"        
##  [77] "feelings"      "she"           "therefore"     "sister^s"     
##  [81] "london"        "hurst"         "wickham^s"     "officers"     
##  [85] "fitzwilliam"   "were"          "whom"          "manner"       
##  [89] "ladyship"      "bennet^s"      "civility"      "friend"       
##  [93] "ball"          "mary"          "jane^s"        "hope"         
##  [97] "certainly"     "will"          "any"           "that"

Zeta

Zeta: the background

Zeta: the solution

\[\zeta_{(a,b)} = \left(\frac{f_a - f_b}{100}\right) +1\]

(Burrows, 2007; Craig, 2009)

Zeta: outcomes

Zeta’s words

##   [1] "elizabeth"     "jane"          "darcy"         "bennet"       
##   [5] "longbourn"     "opinion"       "wickham"       "bingley"      
##   [9] "lizzy"         "town"          "sister"        "lydia"        
##  [13] "ladies"        "netherfield"   "sisters"       "meryton"      
##  [17] "general"       "therefore"     "darcy^s"       "marriage"     
##  [21] "party"         "london"        "certainly"     "bingley^s"    
##  [25] "hertfordshire" "aunt"          "carriage"      "known"        
##  [29] "settled"       "others"        "collins"       "kitty"        
##  [33] "pemberley"     "daughters"     "elizabeth^s"   "jane^s"       
##  [37] "sister^s"      "engaged"       "perfectly"     "feelings"     
##  [41] "lucas"         "bennet^s"      "importance"    "credit"       
##  [45] "manners"       "happiness"     "able"          "behaviour"    
##  [49] "wickham^s"     "girls"         "most"          "civility"     
##  [53] "sensible"      "satisfied"     "mentioned"     "acquaintance" 
##  [57] "daughter"      "affection"     "attention"     "whom"         
##  [61] "seeing"        "charlotte"     "rosings"       "attentions"   
##  [65] "depend"        "excellent"     "believed"      "assured"      
##  [69] "particularly"  "necessary"     "regard"        "likely"       
##  [73] "conversation"  "colonel"       "many"          "added"        
##  [77] "praise"        "understanding" "mary"          "highly"       
##  [81] "anyone"        "address"       "concern"       "fortune"      
##  [85] "assure"        "agreeable"     "honour"        "consequence"  
##  [89] "given"         "possible"      "mother"        "gardiner"     
##  [93] "de"            "lydia^s"       "bourgh"        "gentlemen"    
##  [97] "pleasing"      "eldest"        "particulars"   "subject"

TF–IDF

TF–IDF

\[ TFIDF_{w} = f_{w} \times \log \frac{N}{n_{t}} \]

TF–IDF’s words

##   [1] "darcy"         "bennet"        "bingley"       "elizabeth"    
##   [5] "wickham"       "collins"       "lydia"         "longbourn"    
##   [9] "gardiner"      "netherfield"   "lucas"         "lizzy"        
##  [13] "meryton"       "pemberley"     "bingley^s"     "rosings"      
##  [17] "kitty"         "darcy^s"       "hertfordshire" "elizabeth^s"  
##  [21] "forster"       "lydia^s"       "bourgh"        "wickham^s"    
##  [25] "fitzwilliam"   "jane"          "catherine"     "bennet^s"     
##  [29] "phillips"      "hunsford"      "brighton"      "collins^s"    
##  [33] "derbyshire"    "hurst"         "charlotte"     "colonel"      
##  [37] "officers"      "william"       "kent"          "lucases"      
##  [41] "gardiner^s"    "denny"         "charlotte^s"   "ladyship"     
##  [45] "judgement"     "catherine^s"   "de"            "jane^s"       
##  [49] "bennets"       "lambton"       "reynolds"      "maria"        
##  [53] "gracechurch"   "etc"           "anyone"        "ladyship^s"   
##  [57] "eliza"         "jenkinson"     "parsonage"     "elopement"    
##  [61] "behaviour"     "georgiana"     "gardiners"     "entail"       
##  [65] "caroline"      "regiment"      "bingleys"      "collinses"    
##  [69] "enumerating"   "fitzwilliam^s" "lucas^s"       "phillips^s"   
##  [73] "cousin"        "saturday"      "shire"         "militia"      
##  [77] "corps"         "james^s"       "lakes"         "bourgh^s"     
##  [81] "clapham"       "fishing"       "newcastle"     "younge"       
##  [85] "ladies"        "niece"         "exposing"      "felicity"     
##  [89] "imprudence"    "uncommonly"    "assembly"      "wednesday"    
##  [93] "nieces"        "engagements"   "louisa"        "twelvemonth"  
##  [97] "gratifying"    "william^s"     "annesley"      "caroline^s"

A new method

An ideal measure

  1. will automatically extract meaningful words for a text A and B
    • by “automatically” we mean that the technique will be unsupervised
    • by “meaningful” we mean that an extracted keyword brings much semantic information about a given subset of texts.
  2. will be able to distinguish if the texts to be compared differ substantially, or only marginally.
  3. will not require any computation-intense algorithms;
  4. will be conceptually simple;
  5. will have an intuitive linguistic interpretation;

An ideal measure (cont.)

  1. will not make any statistical assumptions, i.e. will be non-parametric;
  2. will allow for significance testing of the identified keywords;
  3. will not require any hyperparameters, or at least any arbitrarily chosen hyperparameters;
  4. will correct for frequent words.

A new measure

Shakespeare vs. Conan Doyle

word count word count
the 27595 the 28662
and 26735 and 14109
I 22538 of 13229
northerly 1 revolvers 1

Frequency and position

\[P_{r(x)} = \frac{1}{N}\sum_{i=1}^{r(x)}f(i)\]

which also guarantees that the values are scaled into the range {0, 1}.

Frequency and position

a simpler and more elegant version of the above:

\[P_{r(x)} = \left( \frac{1}{N}\sum_{i=1}^{r(x)}f(x_{i}) \right) - f(x)\]

which is of course equal to:

\[P_{r(x)} = \frac{1}{N}\sum_{i=1}^{r(x-1)}f(x_{i})\]

The cracovian

Having normalized the frequencies/positions of the words, one can easily compute the distance between the same word between two sets (texts or subcorpora), say A and B, by simply estimating their difference. This can be defined as

\[\delta_{x,AB} = \| P_{r(x),A} - P_{r(x),B} \|\]

Cumulative sums

The above geometric interpretation can be also solved algebraically. In this interpretation, the position \(P_{r(x)}\) of a given word is simply a cumulative sum of frequencies of all the preceding words.

The words

##   [1] "elizabeth"     "darcy"         "bennet"        "jane"         
##   [5] "bingley"       "wickham"       "collins"       "lydia"        
##   [9] "lizzy"         "longbourn"     "gardiner"      "ladies"       
##  [13] "netherfield"   "kitty"         "charlotte"     "colonel"      
##  [17] "lucas"         "sisters"       "meryton"       "pemberley"    
##  [21] "daughters"     "bingley^s"     "rosings"       "opinion"      
##  [25] "darcy^s"       "marriage"      "hertfordshire" "de"           
##  [29] "william"       "elizabeth^s"   "forster"       "lydia^s"      
##  [33] "girls"         "bourgh"        "town"          "sister^s"     
##  [37] "gentlemen"     "sister"        "hurst"         "wickham^s"    
##  [41] "fitzwilliam"   "officers"      "ladyship"      "bennet^s"     
##  [45] "party"         "jane^s"        "attentions"    "phillips"     
##  [49] "ball"          "hunsford"      "brighton"      "collins^s"    
##  [53] "derbyshire"    "mary"          "aunt"          "importance"   
##  [57] "depend"        "eliza"         "pleasing"      "eldest"       
##  [61] "general"       "match"         "particulars"   "understanding"
##  [65] "civility"      "compliment"    "praise"        "niece"        
##  [69] "regiment"      "concerned"     "connections"   "felicity"     
##  [73] "maria"         "charlotte^s"   "known"         "parsonage"    
##  [77] "saturday"      "street"        "highly"        "dearest"      
##  [81] "daughter"      "london"        "scheme"        "favour"       
##  [85] "happiness"     "dancing"       "fortune"       "engagement"   
##  [89] "prevailed"     "resentment"    "caroline"      "relations"    
##  [93] "partner"       "kent"          "pounds"        "lucases"      
##  [97] "occurred"      "excellent"     "etc"           "gardiner^s"

Similar yet not identical

##        freqs         zeta            tfidf           cracovian      
##   [1,] "of"          "elizabeth"     "darcy"         "elizabeth"    
##   [2,] "was"         "jane"          "bennet"        "darcy"        
##   [3,] "her"         "darcy"         "bingley"       "bennet"       
##   [4,] "elizabeth"   "bennet"        "elizabeth"     "jane"         
##   [5,] "to"          "longbourn"     "wickham"       "bingley"      
##   [6,] "be"          "opinion"       "collins"       "wickham"      
##   [7,] "mr"          "wickham"       "lydia"         "collins"      
##   [8,] "had"         "bingley"       "longbourn"     "lydia"        
##   [9,] "not"         "lizzy"         "gardiner"      "lizzy"        
##  [10,] "darcy"       "town"          "netherfield"   "longbourn"    
##  [11,] "she"         "sister"        "lucas"         "gardiner"     
##  [12,] "very"        "lydia"         "lizzy"         "ladies"       
##  [13,] "which"       "ladies"        "meryton"       "netherfield"  
##  [14,] "in"          "netherfield"   "pemberley"     "kitty"        
##  [15,] "all"         "sisters"       "bingley^s"     "charlotte"    
##  [16,] "that"        "meryton"       "rosings"       "colonel"      
##  [17,] "they"        "general"       "kitty"         "lucas"        
##  [18,] "bennet"      "therefore"     "darcy^s"       "sisters"      
##  [19,] "their"       "darcy^s"       "hertfordshire" "meryton"      
##  [20,] "but"         "marriage"      "elizabeth^s"   "pemberley"    
##  [21,] "been"        "party"         "forster"       "daughters"    
##  [22,] "such"        "london"        "lydia^s"       "bingley^s"    
##  [23,] "jane"        "certainly"     "bourgh"        "rosings"      
##  [24,] "bingley"     "bingley^s"     "wickham^s"     "opinion"      
##  [25,] "could"       "hertfordshire" "fitzwilliam"   "darcy^s"      
##  [26,] "much"        "aunt"          "jane"          "marriage"     
##  [27,] "am"          "carriage"      "catherine"     "hertfordshire"
##  [28,] "so"          "known"         "bennet^s"      "de"           
##  [29,] "with"        "settled"       "phillips"      "william"      
##  [30,] "mrs"         "others"        "hunsford"      "elizabeth^s"  
##  [31,] "as"          "collins"       "brighton"      "forster"      
##  [32,] "were"        "kitty"         "collins^s"     "lydia^s"      
##  [33,] "them"        "pemberley"     "derbyshire"    "girls"        
##  [34,] "have"        "daughters"     "hurst"         "bourgh"       
##  [35,] "for"         "elizabeth^s"   "charlotte"     "town"         
##  [36,] "is"          "jane^s"        "colonel"       "sister^s"     
##  [37,] "by"          "sister^s"      "officers"      "gentlemen"    
##  [38,] "sister"      "engaged"       "william"       "sister"       
##  [39,] "it"          "perfectly"     "kent"          "hurst"        
##  [40,] "wickham"     "feelings"      "lucases"       "wickham^s"    
##  [41,] "what"        "lucas"         "gardiner^s"    "fitzwilliam"  
##  [42,] "will"        "bennet^s"      "denny"         "officers"     
##  [43,] "collins"     "importance"    "charlotte^s"   "ladyship"     
##  [44,] "herself"     "credit"        "ladyship"      "bennet^s"     
##  [45,] "most"        "manners"       "judgement"     "party"        
##  [46,] "miss"        "happiness"     "catherine^s"   "jane^s"       
##  [47,] "soon"        "able"          "de"            "attentions"   
##  [48,] "this"        "behaviour"     "jane^s"        "phillips"     
##  [49,] "lydia"       "wickham^s"     "bennets"       "ball"         
##  [50,] "who"         "girls"         "lambton"       "hunsford"     
##  [51,] "any"         "most"          "reynolds"      "brighton"     
##  [52,] "dear"        "civility"      "maria"         "collins^s"    
##  [53,] "family"      "sensible"      "gracechurch"   "derbyshire"   
##  [54,] "lady"        "satisfied"     "etc"           "mary"         
##  [55,] "every"       "mentioned"     "anyone"        "aunt"         
##  [56,] "must"        "acquaintance"  "ladyship^s"    "importance"   
##  [57,] "know"        "daughter"      "eliza"         "depend"       
##  [58,] "without"     "affection"     "jenkinson"     "eliza"        
##  [59,] "though"      "attention"     "parsonage"     "pleasing"     
##  [60,] "more"        "whom"          "elopement"     "eldest"       
##  [61,] "letter"      "seeing"        "behaviour"     "general"      
##  [62,] "lizzy"       "charlotte"     "georgiana"     "match"        
##  [63,] "other"       "rosings"       "gardiners"     "particulars"  
##  [64,] "think"       "attentions"    "entail"        "understanding"
##  [65,] "longbourn"   "depend"        "caroline"      "civility"     
##  [66,] "many"        "excellent"     "regiment"      "compliment"   
##  [67,] "good"        "believed"      "bingleys"      "praise"       
##  [68,] "gardiner"    "assured"       "collinses"     "niece"        
##  [69,] "mother"      "particularly"  "enumerating"   "regiment"     
##  [70,] "do"          "necessary"     "fitzwilliam^s" "concerned"    
##  [71,] "no"          "regard"        "lucas^s"       "connections"  
##  [72,] "may"         "likely"        "phillips^s"    "felicity"     
##  [73,] "great"       "conversation"  "cousin"        "maria"        
##  [74,] "hope"        "colonel"       "saturday"      "charlotte^s"  
##  [75,] "well"        "many"          "shire"         "known"        
##  [76,] "than"        "added"         "militia"       "parsonage"    
##  [77,] "ladies"      "praise"        "corps"         "saturday"     
##  [78,] "sisters"     "understanding" "james^s"       "street"       
##  [79,] "time"        "mary"          "lakes"         "highly"       
##  [80,] "netherfield" "highly"        "bourgh^s"      "dearest"      
##  [81,] "manner"      "anyone"        "clapham"       "daughter"     
##  [82,] "friend"      "address"       "fishing"       "london"       
##  [83,] "can"         "concern"       "newcastle"     "scheme"       
##  [84,] "aunt"        "fortune"       "younge"        "favour"       
##  [85,] "kitty"       "assure"        "ladies"        "happiness"    
##  [86,] "charlotte"   "agreeable"     "niece"         "dancing"      
##  [87,] "feelings"    "honour"        "exposing"      "fortune"      
##  [88,] "daughter"    "consequence"   "felicity"      "engagement"   
##  [89,] "before"      "given"         "imprudence"    "prevailed"    
##  [90,] "however"     "possible"      "uncommonly"    "resentment"   
##  [91,] "opinion"     "mother"        "assembly"      "caroline"     
##  [92,] "colonel"     "gardiner"      "wednesday"     "relations"    
##  [93,] "lucas"       "de"            "nieces"        "partner"      
##  [94,] "whom"        "lydia^s"       "engagements"   "kent"         
##  [95,] "marriage"    "bourgh"        "louisa"        "pounds"       
##  [96,] "always"      "gentlemen"     "twelvemonth"   "lucases"      
##  [97,] "might"       "pleasing"      "gratifying"    "occurred"     
##  [98,] "nothing"     "eldest"        "william^s"     "excellent"    
##  [99,] "indeed"      "particulars"   "annesley"      "etc"          
## [100,] "happiness"   "subject"       "caroline^s"    "gardiner^s"

Comparison

Gender keywords (19th century)

LL keywords

##  [1] "maggie"     "adam"       "lydgate"    "dorothea"   "emma"      
##  [6] "lucy"       "hetty"      "elinor"     "tulliver"   "jane"      
## [11] "elizabeth"  "casaubon"   "bulstrode"  "rosamond"   "marianne"  
## [16] "weston"     "fred"       "dinah"      "catherine"  "heathcliff"
## [21] "was"        "darcy"      "tom"        "poyser"     "harriet"   
## [26] "philip"     "felt"       "seemed"     "knightley"  "brooke"    
## [31] "its"        "fairfax"    "linton"     "glegg"      "elton"
##  [1] "upon"      "lady"      "thou"      "lovelace"  "jones"     "phineas"  
##  [7] "my"        "pen"       "lord"      "dear"      "sir"       "major"    
## [13] "letter"    "howe"      "laura"     "harlowe"   "honour"    "madam"    
## [19] "duke"      "george"    "that"      "belford"   "crawley"   "pendennis"
## [25] "sophia"    "clarissa"  "lopez"     "micawber"  "pamela"    "b"        
## [31] "finn"      "so"        "peggotty"  "your"      "captain"

Cracovian

##  [1] "maggie"      "lydgate"     "adam"        "dorothea"    "hetty"      
##  [6] "lucy"        "elinor"      "tulliver"    "casaubon"    "rosamond"   
## [11] "bulstrode"   "marianne"    "weston"      "emma"        "dinah"      
## [16] "elizabeth"   "heathcliff"  "darcy"       "poyser"      "catherine"  
## [21] "fred"        "knightley"   "brooke"      "philip"      "fairfax"    
## [26] "glegg"       "elton"       "linton"      "graham"      "bennet"     
## [31] "seth"        "garth"       "harriet"     "middlemarch" "ladislaw"
##  [1] "lovelace"    "phineas"     "jones"       "major"       "howe"       
##  [6] "harlowe"     "belford"     "crawley"     "pendennis"   "clarissa"   
## [11] "lopez"       "micawber"    "sophia"      "pamela"      "laura"      
## [16] "finn"        "peggotty"    "duke"        "allworthy"   "thou"       
## [21] "amelia"      "clavering"   "osborne"     "copperfield" "pen"        
## [26] "wharton"     "traddles"    "bounderby"   "dobbin"      "rawdon"     
## [31] "solmes"      "b"           "jarndyce"    "dora"        "eleanor"

Zeta

##  [1] "felt"      "feelings"  "feel"      "feeling"   "tone"      "seemed"   
##  [7] "sense"     "rose"      "chapter"   "quiet"     "oh"        "minutes"  
## [13] "cold"      "glance"    "yes"       "deep"      "entered"   "silence"  
## [19] "looked"    "added"     "its"       "smile"     "turned"    "something"
## [25] "seated"    "scarcely"  "speaking"  "suddenly"  "longer"    "strong"   
## [31] "i^ve"      "paused"    "towards"   "continued"
##  [1] "honour"    "lord"      "favour"    "although"  "friend"    "lady"     
##  [7] "dear"      "letter"    "upon"      "dearest"   "honest"    "gentleman"
## [13] "lovelace"  "occasion"  "sir"       "sex"       "letters"   "london"   
## [19] "harlowe"   "worthy"    "whom"      "story"     "therefore" "pen"      
## [25] "thou"      "madam"     "creature"  "tis"       "wretch"    "beloved"  
## [31] "clarissa"  "howe"      "written"   "write"

Collocations

Collocations in corpus linguistics

Word frequencies as probabilities

An example: \[ P(A) = 0.001 \quad\quad P(B) = 0.002 \quad\quad P(A) \times P(B) = 0.000002 \]

Collocations in corpus linguistics

strong tea — *powerful tea

powerful computer — *strong computer

Topic modeling

What’s the aim?

Assumptions

What is a topic?

We formally define a topic to be a distribution over a fixed vocabulary. For example, the genetics topic has words about genetics with high probability and the evolutionary biology topic has words about evolutionary biology with high probability.

(Blei 2012, 78)

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)

Assumptions (cont.)

A topic (50 top words)

fight soldier arms war soldiers field fly sword horse valiant march battle brave messenger arm army trumpet valour kings camp alarum walls join wars slain tent forces gates drum courage trumpets lion town fought foes english armour city saint guard colours victory herald swords fame armed country wounds plain safe …

Shakespeare

Topics in the Shakespearean canon

Fights & swords (topic 6)

Family relations (topic 21)

Tears & sorrow (topic 24)

Night & sleep (topic 23)

Face & kisses (topic 8)

Love (topic 5)

The elements (topic 10)

People? (topic 15)

A mixture of everything? (topic 17)

How to interpret topics?

Indeed calling these models “topic models” is retrospective – the topics that emerge from the inference algorithm are interpretable for almost any collection that is analyzed. The fact that these look like topics has to do with the statistical structure of observed language and how it interacts with the specific probabilistic assumptions of LDA.

(Blei 2012, 79)

Topics in documents

The climax of Romeo and Juliet

The beginnig of The Tempest

A Midsummer Night’s Dream

Topics vs. genres

Topics vs. genres – cluster analysis

Topics vs. genres – PCA

Topics vs. genres – PCA

Topics vs. genres

Titus Andronicus

The Tempest

Hamlet

Thank you!

Appendix

Intersection of the two sets

##   [1] "sisters"       "opinion"       "marriage"      "girls"        
##   [5] "town"          "sister^s"      "sister"        "party"        
##   [9] "ball"          "mary"          "aunt"          "general"      
##  [13] "match"         "understanding" "civility"      "compliment"   
##  [17] "praise"        "known"         "street"        "highly"       
##  [21] "daughter"      "london"        "scheme"        "favour"       
##  [25] "happiness"     "dancing"       "fortune"       "engagement"   
##  [29] "resentment"    "relations"     "partner"       "pounds"       
##  [33] "excellent"     "evident"       "invitation"    "letter"       
##  [37] "wholly"        "anyone"        "settled"       "sensible"     
##  [41] "women"         "extraordinary" "therefore"     "agreeable"    
##  [45] "madam"         "tuesday"       "possibility"   "fortunate"    
##  [49] "credit"        "endeavour"     "dear"          "most"         
##  [53] "success"       "concern"       "others"        "family"       
##  [57] "grateful"      "assure"        "belief"        "engaged"      
##  [61] "carriage"      "gratitude"     "terms"         "honour"       
##  [65] "virtue"        "enjoyment"     "advantage"     "unlucky"      
##  [69] "feelings"      "acknowledged"  "tolerable"     "inquiries"    
##  [73] "such"          "motive"        "objections"    "likewise"     
##  [77] "expectation"   "noble"         "moments"       "attended"     
##  [81] "whom"          "address"       "manners"       "satisfied"    
##  [85] "degree"        "valuable"      "remarkably"    "believed"     
##  [89] "assured"       "very"          "uncle^s"       "respect"      
##  [93] "thousand"      "situation"     "overcome"      "drawing-room" 
##  [97] "visitors"      "design"        "receiving"     "affair"

Jane Austen, Pride and Prejudice

Words for the other set

##   [1] "i^m"       "don^t"     "you^ll"    "won^t"     "kitchen"   "master"   
##   [7] "lad"       "papa"      "bid"       "can^t"     "black"     "fire"     
##  [13] "bed"       "chamber"   "lie"       "exclaimed" "whispered" "there^s"  
##  [19] "o"         "cry"       "hands"     "hair"      "fell"      "hearth"   
##  [25] "god"       "arm"       "soul"      "boy"       "dark"      "old"      
##  [31] "remarked"  "feet"      "answered"  "wind"      "fingers"   "dead"     
##  [37] "earth"     "lay"       "chair"     "mouth"     "rough"     "tongue"   
##  [43] "corner"    "seized"    "white"     "fit"       "death"     "mistress" 
##  [49] "teeth"     "shut"      "crying"    "hold"      "door"      "die"      
##  [55] "sleep"     "held"      "water"     "grew"      "horse"     "hurt"     
##  [61] "afternoon" "darling"   "th"        "breath"    "hand"      "face"     
##  [67] "gaze"      "show"      "presently" "round"     "rose"      "catherine"
##  [73] "watch"     "to-night"  "sprang"    "mine"      "kill"      "presence" 
##  [79] "hate"      "frame"     "child"     "departed"  "minute"    "endure"   
##  [85] "ear"       "bit"       "stepped"   "top"       "bitter"    "passion"  
##  [91] "master^s"  "seat"      "window"    "hat"       "died"      "sit"      
##  [97] "settle"    "ah"        "got"       "lock"

Emily Bronte, Wuthering Heights