In the previous post in this series we coyly unveiled the tantalising mysteries of the Voynich Manuscript: an early 15th century text written in an unknown alphabet, filled with compelling illustrations of plants, humans, astronomical charts, and less easily-identifiable entities.
Stretching back into the murky history of the Voynich Manuscript, however, is the lurking suspicion that it is a fraud; either a modern fabrication or, perhaps, a hoax by a contemporary scribe.
One of the more well-known arguments for the authenticity of the manuscript, in addition to its manufacture with period parchment and inks, is that the text appears to follow certain statistical properties associated with human language, and which were unknown at the time of its creation.
The most well-known of these properties is that the frequency of words in the Voynich Manuscript have been claimed to follow a phenomenon known as Zipf’s Law, whereby the frequency of a word’s occurrence in the text is inversely proportional to its rank in the list of words ordered by frequency.
In this post, we will scrutinise the extent to which the expected statistical properties of natural languages hold for the arcane glyphs presented by the Voynich manuscript.
Zipf’s Law is an example of a discrete power law probability distribution. Power laws have been found to lurk beneath a sinister variety of ostensibly natural phenomena, from the relative size of human settlements to the diversity of species descended from a particular ancestral freshwater fish.
In its original context of human langauge, Zipf’s Law states that the most common word in a given language is likely to be roughly twice as common as the second most common word, and three times as common as the third most common word. More precisely, this law holds for much of the corpus, as the law tends to break down somewhat at both the most-frequent and least-frequent words in the corpus1. Despite this, we will focus on the original, simpler Zipfian characterisation in this analysis.
The most well-known, if highly flawed, method to determine whether a distribution follows a power law is to plot it with both axes expressed as a log-scale: a so-called log-log plot. A power law, represented in such a way, will appear linear. Unfortunately, a hideous menagerie of other distributions will also appear linear in such a setting.
More generally, it is rarely sensible to claim that any natural phenomenon follows a given distribution or model, but instead to demonstrate that a distribution presents a useful model for a given set of observations. Indeed, it is possible to fit any set of observations to a power law, with the assumption that the fit will be poor. Ultimately, we can do little more than demonstrate that a given model is the best simulacrum of observed reality, subject to the uses to which it will be put. Certainly, a more Bayesian approach would advocate building a range of models, demonstrating that the power law is most accurate. All truth, it seems, is relative.
Faced with the awful statistical horror of the universe, we are reduced to seeking evidence against a phenomenon’s adherence to a given distribution. Our first examination, then, is to see whether the basic log-log plot supports or undermines the Voynich Manuscript.
A crude visual analysis certainly supports the argument that, for much of the upper half of the Voynich corpus, there is a linear relationship on the log-log plot consistent with Zipf’s Law. As mentioned, however, this superficial appeal to our senses leaves a gnawing lack of certainty in the conclusion. We must turn to less fallible tools.
The poweRlaw package for R is designed specifically to exorcise these particular demons. This package attempts to fit a power law distribution to a series of observations, in our case the word frequencies observed in the corpus of Voynich text. With the fitted model, we then attempt to disprove the null hypothesis that the data is drawn from a power law. If this attempt to betray our own model fails, then we attain an inverse enlightenment: there is insufficient evidence that the model is not drawn from a power law.
This is an inversion of the more typical frequentist null hypothesis scenario. Typically, in such approaches, we hope for a low p-value, typically below 0.05 or even 0.001, showing that the chance of the observations being consistent with the null hypothesis is extremely low. For this test, we instead hope that our p-value is insufficiently low to make such a claim, and thus that a power law is consistent with the data.
The diagram above shows a fitted parameterisation of the power law according to the poweRlaw package. In addition to the visually appealing fit of the line, the weirdly inverted logic of the above test provides a p-value of
0.151. We thus have as much confidence as we can have, via this approach, that a power law is a reasonable model for the text in the Voynich corpus.
Led further down twisting paths by this initial taste of success, we can now present the Voynich corpus against other human-language corpora to gain a faint impression of how similar or different it is to known languages. The following plot compares the frequency of words in the Voynich Manuscript to those of the twenty most popular languages in Wikipedia, taken from the dataset available here.
The Voynich text seems consistent with the behaviour of known natural languages from Wikipedia. The most striking difference being the clustering of Voynich word frequencies in the lower half of the diagram, resulting from the smaller corpus of words in the Voynich Manuscript. This causes, in particular, lower-frequency words to occur an identical number of times, resulting in vertical leaps in the frequency graph towards the lower end.
To highlight this phenomenon, we can apply a similar technique to another widely-translated short text: the United Nations Declaration of Human Rights.
A Refined Randomness
The above arguments might at first appear compelling. The surface incomprehensibility of the Voynich Manuscript succumbs to the deep currents of statistical laws, and reveals an underlying pattern amongst the chaos of the text.
Sadly, however, as with all too many arguments in the literature regarding power law distributions arising in nature, there is a complication to this argument that again highlights the difference between proof and the failure to disprove. Certainly, if a power law had proved incompatible with the Voynich Manuscript then we would have doubted its authenticity. With its apparent adherence to such a distribution, however, we have taken only one hesitant step towards confidence.
Rugg has argued that certain random mechanisms can produce text that adheres to Zipf’s Law, and has demonstrated a simple mechanical procedure for doing so. A more compelling argument is presented, without reference to the Voynich Manuscript, by Li. (1992)2, who demonstrates that a text drawn entirely at random from any given alphabet of symbols that includes a space will result in a text adhering to some form of Zipf-like distribution. We cannot hang our confidence on such a slender thread.
While Zipf’s Law has been shown to hold for human language text, and a text that does not demonstrate it is certainly suspect, it is far from being the only telltale statistical property of natural language. We have already briefly examined sequences of repeated words in the text; we will now delve further.
Another curious distortion of human languages is that they demonstrate a preference for shorter words. The precise mechanism that results in this apparently universal property is unclear, but likely relates somehow to efficiency of communication. Regardless of the deeper causality, in most natural language texts there is a markedly higher frequency of short words than longer words.
As demonstrated by Sigurd, Eeg-Olofsson, and van Weijer, (2004)3, however, the very shortest words are not the most common. Instead, at least for English, Swedish, and German, words between 3 and 5 letters in length dominate. This property can be accurately modelled by an appropriately parameterised Gamma distribution.
Notably, and conveniently, this property will not hold for random texts as described above. These purely stochastic texts would be expected to produce, in the long term, a monotonically-decreasing function as word length increases.
To demonstrate this effect, we can simulate a purely random text along the lines discussed by Li, and show its correspondingly naïve descending plot.
As can be seen, the word-length frequency of this random text forms a comfortingly simple exponential curve, with pseudo-words of one letter being by far the most common. It is also notable that, at the far reaches of the probability distribution, this cacophonous experiment will produce words of almost four-hundred letters in length. While adherence to Zipf’s Law would have misled us into supporting this as an apparently natural language, even a cursory glance at this plot would have convinced us otherwise.
How, then, does the Voynich Manuscript adhere to the expected Gamma distribution of Sigurd et al.? We can employ the excellent fitdistrplus package to peel back this particular veil.
The word-length frequency distribution of the Voynich Manuscript clearly demonstrates a preference for four-letter words, not only breaking free of the confines of pure randomness, but also corresponding broadly with observed frequency patterns of the languages tested by Sigurd et al.
These analyses can only present a dim outline of the text itself, and we resist the awful temptation to attempt any form of decipherment. Certainly, the evidence here seems convincing enough that the Voynich Manuscript does represent a human language, but the statistics presented here are of little use in such an effort. It is likely, of course, that the most frequent words in the manuscript may, under certain assumptions, correspond to the most common words or particles in many languages — the definite article, the indefinite article, conjunctions, pronouns, and similar. Without deeper knowledge of the language, however, and with the range of scribing conventions and shortcuts commonplace in texts of the period, these techniques are too limited to do more than tantalise us with what we may never know.
Subjecting the text of the Voynich Manuscript to the crude frequency analyses presented here can support, although not prove, the view that the manuscript, regardless of its true content, is not simply random gibberish. Nor is the text likely to be the result of a simple mechanical process designed without knowledge of the statistical patterns of human languages. Neither is it likely to be any form of cryptogram more sophisticated than the simplest ciphers, as these would have tended to compromise the statistical properties that we have observed.
The demonstrable following of Zipf’s Law, and the adherence to a Gamma distribution of similar shape to known languages, strongly suggests that the text is likely a representation of some natural language.
In the next post we will attempt blindly to wrench more secrets from the text itself through application of modern textual analysis techniques. Until then the Voynich Manuscript remains, silently obscure, beyond the reach of our faltering science.
- The distributions observed in natural languages are therefore best described as following several parameterisations of the slightly more complex, generalised Zipf–Mandelbrot distribution, with different parameters for the most common, middle, and least-common segments of the corpus.
- Li, W. ‘Random Texts Exhibit Zipf’s-Law-like Word Frequency Distribution’. IEEE Transactions on Information Theory 38, no. 6 (November 1992): 1842–45. https://doi.org/10.1109/18.165464.
- Sigurd, Bengt, Mats Eeg-Olofsson, and Joost van Weijer. ‘Word Length, Sentence Length and Frequency – Zipf Revisited’. Studia Linguistica 58, no. 1 (April 2004): 37–52. https://doi.org/10.1111/j.0039-3193.2004.00109.x.