Zipf’s Law and Subsets of Lexis

Maciej Eder, Rafał L. Górski, Joanna Byszuk

Zipf’s Law and Subsets of Lexis

Maciej Eder, Rafał L. Górski, Joanna Byszuk

Institute of Polish Language (Polish Academy of Sciences)

Qualico 2018, Wrocław, 5th July 2018

Zipf’s law and language


Zipf’s Law on a log-log scale


Research question

Observations on Brown Corpus

[W]ord categories are also fit nicely […] perhaps even more closely than words—but the shape of the fit […] differs.

The general pattern suggests that a full explanation of the word frequency distribution would ideally call on mechanisms general enough to apply to syntactic categories and possibly even other levels of analysis.

(Piantadosi, 2014)



First observations: singular vs. plural


First observations: 1st, 2nd, 3rd person


First observations: cases


First observations: POS-tag ngrams


Modeling a power law distribution

model = lm( var_frequency ~ var_rank )
model = lm( log(var_frequency) ~ log(var_rank) )

Linear regression on log-log data

If in so doing one discovers a distribution that approximately falls on a straight line, then one can, if one is feeling particularly bold, assert that the distribution follows a power law, with a scaling parameter α given by the absolute slope of the straight line.

(Clauset et al., 2009)

Fitting a power law

\[ \alpha = 1 + n \Big[ \sum_{i=1}^{n} \ln \frac{x_i}{x_{min}} \Big] ^{-1} \]

\[ \alpha \simeq 1 + n \Big[ \sum_{i=1}^{n} \ln \frac{x_i}{x_{min} - \frac{1}{2}} \Big] ^{-1} \]

\[ D = \max_{x \geq x_{min}} | S(x) - P(x) | \]

Fitted parameters: \(x_{min}\) (cutoff)

Fitted parameters: \(\alpha\) (scaling)


What is perfectly Zipfian?

Prepositions and conjuctions:

POS Occurrences \(\alpha\) ZTypes ZTokens
prep 28,787,398 1.14 97.9% 99.99%
conj 10,455,657 1.21 81.03% 99.99%
comp 4,145,149 1.2 87.23% 99.99%

What is least Zipfian?


POS Occurrences \(\alpha\) ZTypes ZTokens
praet 11,995,036 1.97 5.12% 81.23%
pant 35,235 2.05 14.54% 79.09%
ppas 3,187,531 2.24 4.62% 68.1%
pact 1,209,948 2.25 3.5% 65.32%
pcon 662,548 2.3 4.5% 64.38%

Major parts of speech


Stable, (un)expected Zipf results for both types and tokens:

Case ZTypes ZTokens
acc 98% 18%
voc 95.5% 10%
dat 93% 7%
loc 91% 3.5%
gen 88% 7%
inst 75% 2.5%
nom 65% 2.5%

Relation between \(\alpha\) (scaling) and coverage?

Parameter \(\alpha\) vs. \(\sqrt{\%}\) of Zipfian tokens

Parameter \(\alpha\) vs. \(\%\) of Zipfian types


Thank you!

This research is part of project UMO-2013/11/B/HS2/02795, supported by Poland’s National Science Centre.