The Voynich Manuscript

While the world abounds with strange phenomena ripe for analysis in their raw state, there is a peculiar pleasure in scrutinising arcane information curated and obscured by the human mind.

The Voynich Manuscript is one of the most well-known and studied volumes of occult knowledge. The book’s most recent history involves its purchase in 1912 by Wilfrid Voynich, a rare book dealer, from a sale of manuscripts by the Society of Jesus at the Villa Mondragone, Frascati. Following several fruitless years of attempts to decipher the manusript and discover its origin, or to interest others in it, Wilfrid Voynich died. The book passed through a number of other hands before being donated to Yale University by the noted rare book dealer Hans P. Kraus in 1969. It now resides in Yale’s Beinecke Rare Book and Manuscript Library with the designation MS 408.

Written almost entirely in an unknown script, barring a small number of words apparently in Latin and High German, the manuscript is compellingly illustrated with depictions of plants, herbs, human figures, astronomical and astrological symbols. The manuscript has resisted all attempts at interpretation by cryptographers, historians, and linguists.

Voynich Manuscript Folio 178

Voynich Manuscript – Folio 178

From a linguistic and cryptographic perspective, this lack of success in interpretation is not surprising. The two-hundred or so folios of the manuscript, while beautifully illuminated, present a sadly limited corpus of text for the purposes of traditional analysis.

In this short series of posts we will subject the Voynich Manuscript to a range of text analysis techniques, delving into its structure, gain horrific insight into its composition, and skeptically assessing its credibility. The manuscript has been subjected to almost fifty years of furtive attempts by cryptographers, including the US National Security Agency and a menagerie of others from the distinguished to the deranged. We will crudely mimic some earlier results, and hopefully add our own confusion to the roiling mass of current research into the Voynich Manuscript.


Since its discovery, and throughout the ongoing unsuccessful attempts to decipher its contents, many have questioned the authenticity of the Voynich Manuscript. The theory that the entire book is a hoax, either by contemporary scribes or by more modern players, has been raised repeatedly over the years.

Radiocarbon dating in 2010 asserted that the manuscript’s parchment likely dates from the early 15th century; the volume of parchment in the manuscript, and its consistency across the document, make it unlikely, although not impossible, that the book is a modern-day hoax.

Other supporting evidence has drawn from early mentions of the manuscript in correspondence. According to, which presents a far more detailed and thorough description of the research around the manuscript and its history than we could hope to offer here, the first extant mention of the manuscript can be found in a 1639 letter from Athanasius Kircher in Rome, replying to a letter forwarded from Georgius Barschius of Prague by the mathematician Theodor Moretus.

The letter refers to a “book of mysterious steganography” (“libellum… …steganographici mysterisi”) illustrated with pictures of plants, stars and chemical secrets that Kirscher had not yet had time to decipher. Barschius had sought out Kirscher’s expertise due to his fame at the time for claiming to have, erroneously as it later transpired, deciphered the hieroglyphic writing system of the Ancient Egyptian language. Later correspondence between Barschius and Kirscher appears, according to Zandbergen1, to suggest strongly that the mysterious book in question is the Voynich Manuscript based on its description.

A Statistical Argument: Zipf’s Law

We now turn from historical sources to darker, more statistical realms. There is compelling support for the notion that, regardless of the true meaning of the book, its contents are drawn from a human language and are neither random symbols nor any form of sophisticated cipher.

One of the pillars of this argument is that certain statistical properties of the Voynich Manuscripts text strongly resemble those of natural, human languages, and which are unlikely, although not impossible, to arise from random text, artificially generated text, or most forms of encipherment.

The most well-known of these statistical properties is the apparent adherence of the manuscript to Zipf’s Law. This law, made famous by the US linguist George Zipf, observes that in corpora of natural languages, the frequency of a word is inversely proportional to its rank when words from a corpus are ordered by frequency. More plainly: the most common word in a language is likely to be n times more common than the second most common word; the second word will be roughly n times more common than the third word, and so on. Whilst merely an approximation, this law can be seen to hold for most human languages, and for a range of other natural phenomena.

Random gibberish, on the other hand, would most likely not follow Zipf’s Law, although carefully crafted gibberish certainly could. Rugg has demonstrated that a simple mechanical procedure can produce randomised text that adhered to Zipf’s Law, although the example he provides is both somewhat contrived and also presupposes a knowledge of this statistical quirk of human languages in the first place. Given that the physical makeup of the Voynich Manuscript dates to the early 15th Century, some four centuries before Zipf popularised this mathematical assessment of human languages, the argument that it is a contemporary act of calligraphic glossolalia seems strained.

Similarly, most forms of cryptography beyond the simplest substitution ciphers would also skew the text away from Zipf’s Law. It is notable that the Voynich Manuscript predates even works such as Trithemius‘s Steganographia, or the Book of Soyga and its magic tables of letters that so obsessed John Dee.

In contrast, however, it has been claimed that other features of the text raise doubts. One of the most commonly stated counter-arguments to the natural hypothesis of the Voynich text is that some words are repeated an unnatural number of times. Depending on the transcription, individual words have been reported to be repeated up to five times. Whilst this is not an impossible occurrence in human lanugage, it is highly irregular.

The next post in this short series will focus on the Voynich Manuscript’s adherence, or lack thereof, to Zipf’s Law in full. Following that, we will see the extent to which other forms of modern textual analysis can be applied to dissect the arcane and unrelenting secrets of MS 408.

This post, however, will describe the contortions required to render the Voynich text suitable for our particular form of scrutiny.


Given the format and presentation of the text, we make several assumptions about the writing system contained in the Voynich Manuscript:

  • It is written in an alphabet, or potentially an abjad or even an abugida2, and not a logographic system. That the text is not logographic is justified by the small number of individual symbols. The distinction between the other systems is sufficiently subtle that it will not affect our analyses3.
  • The manuscript is written from left to right, and not the reverse, vertically, boustrophedon. This is uncontroversial and apparent from even a cursory inspection of the text itself; the horizontal flow of the writing is clear, with lines clearly starting at the left margin and ending before the right. The text is separated into paragraphs, of which the final line is justified to the left.


Due to the diligent activity of several generations of Voynich researchers, the text of the manuscript has been transcribed into a machine-readable format. As the alphabet is unknown, there are minor uncertainties in rendering the text, leading to a number of similar but competing transcriptions. The subtle details of the various transcription efforts, and their history, are available at:, with the raw data available at We have settled on the v101 transliteration by Glen Claston, rendered in the Intermediate Voynich Transliteration File Format (IVTFF) of Zandbergen. This is one of the more recent and widely-used transcriptions, and has the added advantage of being supported by the availability of a TrueType font. The underlying file is available here:

Crude Manipulations

We perform the following steps to make the data usable for our analyses. For many scenarios, we would develop a generalisable set of steps to allow conversion of many documents to an appropriate form. Until and unless, however, a new cache of documents in the same language are found, it is simpler and easier to perform these one-time steps manually.

Firstly, we delete from the text all incomplete words, as marked in the IVTFF format. This includes:

  • all text in angle brackets
  • all words containing ?’s
  • all words containing []

Secondly, we tokenize the text and remove punctuation. The transcription of the Voynich manuscript that we have chosen uses the following punctuation:

  • “.” is a space
  • “,” is a potential space. For simplicity, we do not treat these as a space.

Finally, we organize the document in an appropriate form to be imported into an R data frame, or tidyverse tibble.

The above steps were performed in the Vim text editor, and the commands used are reproduced in the code below:

Show Vim text manipulation commands.

# Delete all commented lines

# Remove blank lines

# Remove "," -- assume that potential spaces are /not/ spaces. 

# Replace each folio's page marker (initial for each page) 
# with its contents, followed by a comma. (<f1r> -\> f1r,)

# Remove all \<\> entries (non-greedy)

# Join all paragraphs (all newlines followed by a character 
# other than a newline are removed).

# Replace "high ascii" rare characters from the IVTFF with their 
# ASCII representation. (<>)

# Replace full stops with spaces
:\%s/./ /g

The resulting raw data file is available here. This file can be read into R simply by use of the read.csv function:

voynich_tbl <- 
	read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%
	rename( folio = X1, text = X2 )

As a first, horrifying glance into the forms of analysis that this allows, we can now use our raw data to identify the most repeated words in the manuscript, according to our transcription. The following R code extracts the entirety of the text and encodes it as a run length encoding. This conveniently results in a sequential list of words and the number of times that each is repeated in sequence. We can then simply extract the largest number of repetitions for each word in the corpus:

Count longest word repetition sequences in the Voynich Manuscript.

library( tidyverse )
library( magrittr )

# Count the number of repeated words in the Voynich Manuscript text.

# Load the raw data
voynich_tbl <- 
	read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%
	rename( folio = X1, text = X2 )

# Extract the text as a vector of words
voynich_vector <- 
	voynich_tbl %>%
	extract2( "text" ) %>%
	paste( sep=" ", collapse=" " ) %>%
	str_split( " " ) %>%

# Create a run length encoding object from the vector
voynich_rle <- 
	voynich_vector %>%

# Convert rle object to a data frame and report the maximum number of repeated
# cases for each word
voynich_repetitions <- 
	voynich_rle %>%
	unclass %>% %>%
	group_by( values ) %>%
	summarise( max_repetitions = max( lengths ) ) %>%
	ungroup %>%
	arrange( desc( max_repetitions ) )

This simple analysis shows that, in the transcription we have chosen, the longest sequences of repeated words are only three words in length, occuring a total of five times in the text. While there are many other arguments against the potential validity of the Voynich Manuscript, word repetition does in itself present a compelling reason to doubt that the text is a human language.

We have now reduced the strange and beautiful elegance of the Voynich Manuscript’s centuries-old illuminations to a crude, utilitarian abstraction. With this particular act of artistic and literary desecration complete, in the next post we will examine Zipf’s Law in more detail, and interrogate the extent to which this law supports or undermines the text’s authenticity.


  2. Indeed, the Ge’ez language, from which the term abugida was derived, has at various times been proposed as a candidate for the source language of the Voynich Manuscript
  3. In fact, most of the analyses we perform would also function in a logographic system.