Zipf’s Law and Subsets of Lexis

Maciej Eder, Rafał L. Górski, Joanna Byszuk

Zipf’s Law and Subsets of Lexis

Maciej Eder, Rafał L. Górski, Joanna Byszuk

Institute of Polish Language (Polish Academy of Sciences)

Qualico 2018, Wrocław, 5th July 2018

Zipf’s law and language

qSUgV

Zipf’s Law on a log-log scale

word_unigrams

Research question

Observations on Brown Corpus

[W]ord categories are also fit nicely […] perhaps even more closely than words—but the shape of the fit […] differs.

The general pattern suggests that a full explanation of the word frequency distribution would ideally call on mechanisms general enough to apply to syntactic categories and possibly even other levels of analysis.

(Piantadosi, 2014)

Dataset

Categories

First observations: singular vs. plural

singular_plural

First observations: 1st, 2nd, 3rd person

1st_2nd_person

First observations: cases

cases

First observations: POS-tag ngrams

ngrams

Modeling a power law distribution

model = lm( var_frequency ~ var_rank )
model = lm( log(var_frequency) ~ log(var_rank) )

Linear regression on log-log data

If in so doing one discovers a distribution that approximately falls on a straight line, then one can, if one is feeling particularly bold, assert that the distribution follows a power law, with a scaling parameter α given by the absolute slope of the straight line.

(Clauset et al., 2009)

Fitting a power law

\[ \alpha = 1 + n \Big[ \sum_{i=1}^{n} \ln \frac{x_i}{x_{min}} \Big] ^{-1} \]

\[ \alpha \simeq 1 + n \Big[ \sum_{i=1}^{n} \ln \frac{x_i}{x_{min} - \frac{1}{2}} \Big] ^{-1} \]

\[ D = \max_{x \geq x_{min}} | S(x) - P(x) | \]

Fitted parameters: \(x_{min}\) (cutoff)

Fitted parameters: \(\alpha\) (scaling)

Results

What is perfectly Zipfian?

Prepositions and conjuctions:

POS Occurrences \(\alpha\) ZTypes ZTokens
prep 28,787,398 1.14 97.9% 99.99%
conj 10,455,657 1.21 81.03% 99.99%
comp 4,145,149 1.2 87.23% 99.99%

What is least Zipfian?

Participles:

POS Occurrences \(\alpha\) ZTypes ZTokens
praet 11,995,036 1.97 5.12% 81.23%
pant 35,235 2.05 14.54% 79.09%
ppas 3,187,531 2.24 4.62% 68.1%
pact 1,209,948 2.25 3.5% 65.32%
pcon 662,548 2.3 4.5% 64.38%

Major parts of speech

Cases

Stable, (un)expected Zipf results for both types and tokens:

Case ZTypes ZTokens
acc 98% 18%
voc 95.5% 10%
dat 93% 7%
loc 91% 3.5%
gen 88% 7%
inst 75% 2.5%
nom 65% 2.5%

Relation between \(\alpha\) (scaling) and coverage?

Parameter \(\alpha\) vs. \(\sqrt{\%}\) of Zipfian tokens

Parameter \(\alpha\) vs. \(\%\) of Zipfian types

Conclusions

Thank you!

This research is part of project UMO-2013/11/B/HS2/02795, supported by Poland’s National Science Centre.