<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>linguistics &#8211; Weird Data Science</title>
	<atom:link href="https://www.weirddatascience.net/category/linguistics/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.weirddatascience.net</link>
	<description>Paranormal Distributions. Cyclopean Data. Esoteric Regression.</description>
	<lastBuildDate>Sun, 28 Jan 2024 16:27:02 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
<site xmlns="com-wordpress:feed-additions:1">143387998</site>	<item>
		<title>Readings from the Book</title>
		<link>https://www.weirddatascience.net/2024/01/28/readings-from-the-book/</link>
					<comments>https://www.weirddatascience.net/2024/01/28/readings-from-the-book/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Sun, 28 Jan 2024 16:27:02 +0000</pubDate>
				<category><![CDATA[beyond the veil]]></category>
		<category><![CDATA[bibliophilia]]></category>
		<category><![CDATA[event]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[stan]]></category>
		<guid isPermaLink="false">https://www.weirddatascience.net/?p=5768</guid>

					<description><![CDATA[<div class="mh-excerpt">Once again, the Oxford Internet Institute at the University of Oxford -- through madness, or through omission brought on by horrified incredulity -- saw fit to expose its students to the nightmarish patterns that descend, fractal-like, endlessly below the surface of mundane reality. This second OII Halloween Lecture drew on the twisted meanderings we travellers have taken through the cryptic verbiage of the Voynich Manuscript.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2024/01/28/readings-from-the-book/" title="Readings from the Book">[...]</a>]]></description>
										<content:encoded><![CDATA[<p>Once again, the Oxford Internet Institute at the University of Oxford &#8212; through madness, or through omission brought on by horrified incredulity &#8212; saw fit to expose its students to the nightmarish patterns that descend, fractal-like, endlessly below the surface of mundane reality.</p>
<p>This second OII Halloween Lecture drew on the twisted meanderings we travellers have taken through the cryptic verbiage of the Voynich Manuscript. We aim to establish the dread authenticity of the text, by rousing its very statistical bones from the inscrutable fasciae of its pages. Walking a tightrope between careful statistical exploration and ever-burgeoning insanity, we further explore the structures that arise from the text, separating the untranslated knowledge in the book into coherent bodies for future study.</p>
<p>In yet another, almost criminially negligent, oversight, the OII&#8217;s 2024 Halloween Lecture was captured, frozen in space and time, for the detriment and despair of the unexpectant world.</p>

<div class="youtube-embed ye-container" itemprop="video" itemscope itemtype="https://schema.org/VideoObject">
	<meta itemprop="url" content="https://www.youtube.com/v/nl7QRWIRcSk" />
	<meta itemprop="name" content="Readings from the Book" />
	<meta itemprop="description" content="Readings from the Book" />
	<meta itemprop="uploadDate" content="2024-01-28T16:27:02+00:00" />
	<meta itemprop="thumbnailUrl" content="https://i.ytimg.com/vi/nl7QRWIRcSk/default.jpg" />
	<meta itemprop="embedUrl" content="https://www.youtube.com/embed/nl7QRWIRcSk" />
	<meta itemprop="height" content="340" />
	<meta itemprop="width" content="560" />
	<iframe style="border: 0;" class="youtube-player" width="560" height="340" src="https://www.youtube.com/embed/nl7QRWIRcSk" allowfullscreen></iframe>
</div>

<p>&nbsp;</p>
<p>For those not driven to blissful negation by the tortured ramblings of the above, the underlying materials for the talk are presented, with neither hope nor tremor, here.</p>
<a href="https://www.weirddatascience.net/wp-content/uploads/2024/01/illuminating_the_illuminated.pdf" class="pdfemb-viewer" style="" data-width="max" data-height="max" data-mobile-width="500"  data-scrollbar="none" data-download="on" data-tracking="on" data-newwindow="on" data-pagetextbox="off" data-scrolltotop="off" data-startzoom="100" data-startfpzoom="100" data-toolbar="bottom" data-toolbar-fixed="off">illuminating_the_illuminated<br/></a>
<p>&nbsp;</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2024/01/28/readings-from-the-book/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5768</post-id>	</item>
		<item>
		<title>Illuminating the Illuminated – Part Three: Topics of Invention &#124; Topic Modelling the Voynich Manuscript</title>
		<link>https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/</link>
					<comments>https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Tue, 24 Dec 2019 16:51:08 +0000</pubDate>
				<category><![CDATA[bibliophilia]]></category>
		<category><![CDATA[cryptology]]></category>
		<category><![CDATA[linguistics]]></category>
		<guid isPermaLink="false">https://www.weirddatascience.net/?p=1200</guid>

					<description><![CDATA[<div class="mh-excerpt">Our <a href="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/">earlier experiments</a> derived some of the darker statistics of the Voynich Manuscript supporting the conjecture, but not erasing all doubt, that the manuscript's cryptic graphemes are drawn from some natural, or shudderingly unnatural, language. Despite our beliefs regarding its authenticity, however, the statistical tools we have employed so far can tell us little about the structure, and almost nothing of the meaning, of the Voynich Manuscript. </div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/" title="Illuminating the Illuminated – Part Three: Topics of Invention &#124; Topic Modelling the Voynich Manuscript">[...]</a>]]></description>
										<content:encoded><![CDATA[<p>Our <a href="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/">earlier experiments</a> derived some of the darker statistics of the Voynich Manuscript supporting the conjecture, but not erasing all doubt, that the manuscript&#8217;s cryptic graphemes are drawn from some natural, or shudderingly unnatural, language.</p>
<p>Despite our beliefs regarding its authenticity, however, the statistical tools we have employed so far can tell us little about the structure, and almost nothing of the meaning, of the Voynich Manuscript. In this post, whilst shying away from the madness and confusion of attempting to translate <a href="https://brbl-dl.library.yale.edu/vufind/Record/3519597">MS&nbsp;408</a>, or of definitively identifying its language, we will delve into the extent to which modern natural language processing techniques can reveal its lesser secrets.</p>
<p>The mechanisms we will apply in this post are drawn from the world of <em>topic modelling</em>, an approach widely used in the processing of human language to identify eerily related documents within a corpus of text.</p>
<p>Topic modelling, in its most widely used form, lies in considering each given document as a nebulous admixture of unseen and unknowable <em>topics</em>. These topics, in effect, are themselves probability distributions of words that are likely to occur together. Each document, therefore, is characterised as a set of probability distributions that generate the observed words. This approach, known as <a href="http://jmlr.csail.mit.edu/papers/v3/blei03a.html">Latent Dirichlet Allocation</a>, dispassionately extracts the hidden structure of documents by deriving these underlying distributions.</p>
<p>For known languages, latent Dirichlet allocation extrudes a set of topics characterised by the high-probability words that they generate. These, in turn, can be subjected to human interpretation to identify the semantic underpinnings behind the topics.</p>
<p>To illustrate, we present a topic model of Margaret A. Murray&#8217;s seminal 1921 work <a href="https://en.wikipedia.org/wiki/The_Witch-Cult_in_Western_Europe">&#8220;The Witch Cult in Western Europe&#8221;</a>. There are many uneasy subtleties in producing such a model, into which we will not plunge at this early stage; at a quick glance, however, we can see that from Murray&#8217;s detailed research and interweaved arguments for a modern-day survival of an ancient witch cult in Europe, the algorithm can identify certain prevalent themes. The third topic, for example, appears to conjure terms related to the conflict between the accepted state religion and the &#8216;heathen&#8217; witch cult. The ninth topic concerns itself with the <a href="https://en.wikipedia.org/wiki/Witches%27_mark">witches&#8217; marks</a>, supposedly identified on the body of practitioners; while the tenth dwells on the clandestine meetings and <a href="https://en.wikipedia.org/wiki/Witches%27_Sabbath">sabbaths</a> of the cult.</p>
<figure id="attachment_1206" aria-describedby="caption-attachment-1206" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.png"><img fetchpriority="high" decoding="async" data-attachment-id="1206" data-permalink="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/wcwe_topic_plot_13-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Witch Cult in Western Europe Topic Plot" data-image-description="&lt;p&gt;Topic plot for Murray&amp;#8217;s &amp;#8220;The Witch Cult in Western Europe&amp;#8221;&lt;/p&gt;
" data-image-caption="&lt;p&gt;Topic plot for Murray&amp;#8217;s &amp;#8220;The Witch Cult in Western Europe&amp;#8221;&lt;/p&gt;
" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.png" alt="Topic plot for Murray&#039;s &quot;The Witch Cult in Western Europe&quot;" width="1920" height="1080" class="size-full wp-image-1206" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-678x381.png 678w" sizes="(max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-1206" class="wp-caption-text">Topic plot for Murray&#8217;s &#8220;The Witch Cult in Western Europe&#8221; | (<a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Witch Cult Topic Model Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>wcwe_topics.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( tidytext )<br />
library( widyr )</p>
<p>library( stm )<br />
library( quanteda )</p>
<p>library(cowplot)</p>
<p># For reorder_within() for facets: &lt;https://juliasilge.com/blog/reorder-within/&gt;<br />
library( drlib ) 		</p>
<p># Fonts<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># Read (processed) text of Murray&#8217;s &quot;The Witch Cult in Western Europe&quot;.<br />
wcwe_raw &lt;-<br />
	read_csv( &quot;data/wcwe/wcwe_raw.csv&quot;, col_names=FALSE ) %&gt;%<br />
	rename( text = X1 ) %&gt;%<br />
	rowid_to_column( var = &quot;chapter&quot; )</p>
<p># Tokenize<br />
# (Remove words of 3 letters or less)<br />
# Stemming and stopword removal apparently not so effective anyway,<br />
# according to Schofield et al.: &lt;www.cs.cornell.edu/~xanda/winlp2017.pdf&gt;<br />
wcwe_words &lt;-<br />
	wcwe_raw %&gt;%<br />
	unnest_tokens( word, text ) %&gt;%<br />
	filter( !word %in% stop_words$word )  %&gt;%<br />
	filter( str_length( word ) &gt; 3 )</p>
<p>wcwe_word_counts &lt;-<br />
	wcwe_words %&gt;%<br />
	count( word, chapter, sort = TRUE ) </p>
<p># Generate the corpus<br />
wcwe_dfm &lt;-<br />
	wcwe_words %&gt;%<br />
	count( chapter, word, sort=TRUE ) %&gt;%<br />
	cast_dfm( chapter, word, n )</p>
<p># Search for a number of topics and output goodness-of-fit measures. </p>
<p># N=2 is the number of documents &#8216;held out&#8217; for the goodness-of-fit measure.<br />
# (The model is trained on the main body, then used to calculated the<br />
# likelihood of the held-out documents.) N=2 is used here to produce<br />
# approximately 10% of the corpus.</p>
<p>if( not( file.exists( &quot;work/wcwe_topic_search_k.rds&quot; ) ) ) {</p>
<p>	message( &quot;Seaching low-n topic models&#8230;&quot; )</p>
<p>	wcwe_k &lt;-<br />
		searchK( wcwe_dfm, K=c(3:30), N=2 )</p>
<p>	saveRDS( wcwe_k, &quot;work/wcwe_topic_search_k.rds&quot; )</p>
<p>} else {</p>
<p>	wcwe_k &lt;-<br />
		readRDS( &quot;work/wcwe_topic_search_k.rds&quot; )</p>
<p>}</p>
<p># Plot semantic coherence against exclusivity for model selection<br />
wcwe_k_plot &lt;-<br />
	wcwe_k$results %&gt;%<br />
	gather( key=&quot;variable&quot;, value=&quot;value&quot;, exclus, semcoh )</p>
<p>wcwe_k_semcoh_exclusive &lt;-<br />
	ggplot( wcwe_k_plot, aes( x=K, y=value, group=variable) ) +<br />
	geom_line() +<br />
	facet_wrap( ~variable, ncol=1, scales=&quot;free_y&quot; )</p>
<p># Based on metrics of the above, calculate a 13-topic model<br />
if( not( file.exists( &quot;work/wcwe_topic_stm-13.rds&quot; ) ) ) {</p>
<p>	message( &quot;Calculating 13-topic model&#8230;&quot; )</p>
<p>	wcwe_topic_model_13 &lt;-<br />
		stm( wcwe_dfm, K=13, init.type=&quot;Spectral&quot; )</p>
<p>	# This takes a long time, so save output<br />
	saveRDS( wcwe_topic_model_13, &quot;work/wcwe_topic_stm-13.rds&quot; )</p>
<p>} else {</p>
<p>	wcwe_topic_model_13 &lt;- readRDS( &quot;work/wcwe_topic_stm-13.rds&quot; )</p>
<p>}</p>
<p># Work with the 13-topic model for now<br />
wcwe_topic_model &lt;- wcwe_topic_model_13</p>
<p>### Convert output to a tidy tibble<br />
wcwe_topic_model_tbl &lt;-<br />
	tidy(wcwe_topic_model, matrix = &quot;beta&quot; )</p>
<p>wcwe_topics_top &lt;-<br />
	wcwe_topic_model_tbl %&gt;%<br />
	group_by(topic) %&gt;%<br />
	top_n(10, beta) %&gt;%<br />
	ungroup() %&gt;%<br />
	arrange(topic, -beta)</p>
<p>gp &lt;-<br />
	wcwe_topics_top %&gt;%<br />
	mutate(term = reorder_within(term, beta, topic)) %&gt;%<br />
	ggplot(aes(term, beta, fill = factor(topic))) +<br />
	geom_col(show.legend = FALSE, alpha=0.8 ) +<br />
	facet_wrap(~ topic, scales = &quot;free&quot;) +<br />
	scale_x_reordered() +<br />
	coord_flip()</p>
<p># Palette of ink colours obtained from screenshots of Diamine inks.<br />
ink_colours &lt;- c( &quot;#753733&quot;, &quot;#b6091d&quot;, &quot;#e45025&quot;, &quot;#232d1d&quot;,<br />
					  	&quot;#224255&quot;, &quot;#533f50&quot;, &quot;#453437&quot;, &quot;#7f2430&quot;,<br />
						&quot;#254673&quot;, &quot;#52120e&quot;, &quot;#3d2535&quot;, &quot;#25464b&quot;,<br />
						&quot;#2f2a1c&quot; )</p>
<p>gp &lt;-<br />
	gp + scale_fill_manual( values=ink_colours )</p>
<p>topic_plot_13 &lt;-<br />
	gp + labs( x=&quot;Term&quot;, y=&quot;Probability in Topic&quot; ) +<br />
	theme (<br />
			 plot.title = element_text( family=&quot;bold_font&quot;, size=16 ),<br />
			 plot.subtitle = element_text( family=&quot;bold_font&quot;, size=12 ),<br />
			 axis.text.y = element_text( family=&quot;bold_font&quot;, size=10 )<br />
			 ) </p>
<p>theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p>wcwe_topic_plot &lt;-<br />
	topic_plot_13 +<br />
	theme (<br />
			 axis.title.y = element_text( angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.y = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.title.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title = element_blank(),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 legend.position=c(.85,.5),<br />
			 strip.background = element_blank(),<br />
			 strip.text.x = element_text(size = 10, family=&quot;main_font&quot;)<br />
			 ) </p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;\&quot;The Witch Cult in Western Europe\&quot; (Murray, 1921) Topic Model&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data: Murray, M. \&quot;The Witch Cult in Western Europe\&quot; (1921) | http://www.gutenberg.org/ebooks/20411&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=8, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, wcwe_topic_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>wcwe_topic_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>ggsave( &quot;output/wcwe_topic_plot_13.pdf&quot;, width=16, height=9 )</p>
[/code]
</div></div>
</div>
<p>As the plot above suggests, topic modelling is a tool to support our limited human understanding rather than a cold, mechanical source of objectivity and, as with much unsupervised machine learning, there are various subjective choices that must be made guided by the intended purpose of the analysis. Drawing together impercetible threads of relation in bodies of text, the approach suggests major themes and, crucially, can associate disparate areas of text that focus on similar concerns.</p>
<h1>Topical Remedies</h1>
<p>What, then, can we learn by bringing the oppressive weight of latent Dirichlet allocation to bear against a cryptic tome whose words, and indeed letters, resist our best efforts at interpretation?</p>
<p>Without understanding of individual words, we wil be unable to glean the semantic understanding of topics that was possible with Murray&#8217;s <em>Witch Cult&#8230;</em>. There is a chance, however, that the topic model can derive relations between separated sections of the manuscript &#8212; do certain early pages demonstrate a particular textual relationship to later pages? Do sections of the overall manuscript retain an apparent <em>coherence</em> of topics, with contiguous pages being drawn from a small range of similar topics? Which Voynich words fall under similar topics?</p>
<h1>Preparations</h1>
<p>Topic modelling typically requires text to undergo a certain level of formulaic preparation. The most common of such rituals are <a href="https://en.wikipedia.org/wiki/Stemming"><em>stemming</em></a>, <a href="https://en.wikipedia.org/wiki/Lemmatisation"><em>lemmatization</em></a>, and <a href="https://en.wikipedia.org/wiki/Stop_words"><em>stopword removal</em></a>. Briefly, stemming and lemmatization aim to reduce confusion by rendering words to their purest essence. Stemming is a more crude heuristic, unsympathetically incising endings, and so truncating <em>&#8220;dark&#8221;</em>, <em>&#8220;darker&#8221;</em>, <em>&#8220;darkest&#8221;</em> simply to the atomic root word <em>&#8220;dark&#8221;</em>. Lemmatization requires more understanding, untangling parts of speech and context: that <em>to curse</em> is a verb while <em>a curse</em> is a noun; the two identical words should therefore be treated separately.</p>
<p>Stopword removal aims to remove the overwhelming proportion of shorter, structural words that are ubiquitous throughout any text, but are largely irrelevant to the overall topic: <em>the, and, were, it, they, but, if&#8230;</em>. Whilst key to our understanding of texts, these terms have no significance to the theme or argument of a text.</p>
<p>Undermining our scheme to perform topic modelling, therefore, is the lamentable fact that, without understanding of either the text or its structure, we are largely unable to perform any of these tasks satisfactorily. We have neither an understanding of the grammatical form of Voynich words allowing stemming or lemmatization, or a list of stopwords to excise.</p>
<p>Whilst stemming and lemmatization are unapproachable, at least within the confines of this post, we can effect a crude form of stopword removal through use of a common frequency analysis of the text. Stopwords are, in general, those words that are both most-frequently occuring in some corpus of documents <em>and</em> those that are found across the majority of documents in that language. The second criterion ensures that words occurring frequently in obscure and specialised texts are not considered of undue importance.</p>
<p>This overall statistic is known as <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf"><em>term frequency-inverse document frequency</em></a>, or <a href="https://cran.r-project.org/web/packages/tidytext/vignettes/tf_idf.html"><em>tf-idf</em></a>, and is widely used in information retrieval to identify terms of specific interest within certain documents that are not shared by the wider corpus. For our purposes, we wish to identify and elide those ubiquitous, frequent terms that occur across the entire corpus. To do so, given our lack of knowledge of the structure of the Voynich Manuscript, we will consider each folio as a separate document, and consider only the <em>inverse document frequency</em> as we are uninterested in how common a word within each document. To avoid words that most commonly appear across the manuscript, with a basis in the distribution of stop words in a range of known languages, we therefore remove the 200 words with lowest inverse document frequency scores<span id='easy-footnote-1-1200' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/#easy-footnote-bottom-1-1200' title='The widely-used &lt;a href=&quot;https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip&quot;&gt;NLTK stopword corpus&lt;/a&gt; contains a list of stopwords for 23 world languages, with a notable bias towards European languages. The median length of these stopword lists is 201.5, with values ranging from 53 for Turkish to 1784 for Slovene.'><sup>1</sup></a></span>.</p>
<p>Having contorted the text into an appropriate form for analysis, we can begin the process of discerning its inner secrets. Our code relies on the <a href="https://www.tidytextmining.com/"><code>tidytext</code></a> and <a href="https://cran.r-project.org/package=stm"><code>stm</code></a> packages, allowing for easy manipulation of document structure and topic models<span id='easy-footnote-2-1200' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/#easy-footnote-bottom-2-1200' title='We have also relied extensively on the superb work and writing of &lt;a href=&quot;https://juliasilge.com/blog/sherlock-holmes-stm/&quot;&gt;Silge&lt;/a&gt; on this topic.'><sup>2</sup></a></span>
<h1>Numerous Interpretations</h1>
<p>Topic models are a cautionary example of recklessly <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised machine learning</a>. As with most such approaches, there are a number of subjective choices to be made that affect the outcome. Perhaps the most influential is the selection of the <em>number</em> of topics that the model should generate. Whilst some approaches have been suggested to derive this number purely by analysis, in most cases it remains in the domain of the human supplicant. Typically, the number of topics is guided both by the structure of the text along with whatever arcane purpose the analysis might have. With our imposed lack of understanding, however, we must rely solely on crude metrics to make this most crucial of choices.</p>
<p>Several methods of assessment exist to quantify the fit of a topic model to the text. The two that we will employ, guided by the <code>stm</code> package are <a href="https://rdrr.io/cran/stm/man/semanticCoherence.html"><em>semantic coherence</em></a>, which roughly expresses that words from a given topic should co-occur within a document; and <a href="https://rdrr.io/cran/stm/man/exclusivity.html"><em>exclusivity</em></a>, which values models more highly when given words occur within topics with high frequency, but are also relatively exclusive to those topics.</p>
<p>We select an optimal number of topics by the simple process of calculating models with varying numbers of topics, and assessing when these two scores are maximised. For the Voynich Manuscript we observe that 34 topics appears to be initially optimal<span id='easy-footnote-3-1200' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/#easy-footnote-bottom-3-1200' title='It should be noted that topic modelling is more typically applied to much larger corpora of text than is possible with our restriction to the Voynich Manuscript. Given the relatively short nature of the text, we might prefer to focus on a smaller number of topics. The metrics plot shows a spike in semantic coherence around 12 topics that might be of interest in future analyses.'><sup>3</sup></a></span>.</p>
<figure id="attachment_1211" aria-describedby="caption-attachment-1211" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.png"><img decoding="async" data-attachment-id="1211" data-permalink="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/voynich_topic_selection_metrics-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="voynich_topic_selection_metrics" data-image-description="&lt;p&gt;Selection metrics for Voynich topic model topic numbers.&lt;/p&gt;
" data-image-caption="&lt;p&gt;Selection metrics for Voynich topic model topic numbers.&lt;/p&gt;
" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.png" alt="Voynich Topic Model Selection Metrics" width="1920" height="1080" class="size-full wp-image-1211" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-678x381.png 678w" sizes="(max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-1211" class="wp-caption-text">Selection metrics for Voynich topic model topic numbers. | (<a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.pdf">PDF Version</a>)</figcaption></figure>
<p>The initial preparation of the code, the search through topic models of varying numbers, and the selection of the final 34 topic model is given in the code below alongside plotting code for the metrics diagram.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript Topic Modelling Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_topics-model.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( tidytext )<br />
library( widyr )</p>
<p>library( stm )</p>
<p># install_github(&quot;dgrtwo/drlib&quot;)<br />
library( drlib )</p>
<p># References:<br />
# &lt;http://varianceexplained.org/r/op-ed-text-analysis/&gt;<br />
# &lt;https://cbail.github.io/SICSS_Topic_Modeling.html#working-with-meta-data&gt;<br />
# &lt;https://me.eui.eu/andrea-de-angelis/blog/structural-topic-models-to-study-political-text-an-application-to-the-five-star-movements-blog/&gt;<br />
# &lt;https://scholar.princeton.edu/sites/default/files/bstewart/files/stm.pdf&gt;<br />
# &lt;https://juliasilge.com/blog/evaluating-stm/&gt;</p>
<p>voynich_raw &lt;-<br />
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Read in manually-identiied sections per folio, according to<br />
# &lt;http://www.voynich.nu/descr.html#illustr&gt;<br />
voynich_sections &lt;-<br />
	read_csv( &quot;data/voynich_sections.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, section = X2 )</p>
<p># Merge the above to note section for each folio alongside the text<br />
voynich_tbl &lt;-<br />
	left_join( voynich_sections, voynich_raw )</p>
<p># Tokenize<br />
# Remove words of 3 letters or less.<br />
voynich_words &lt;-<br />
	voynich_tbl %&gt;%<br />
	unnest_tokens( word, text ) %&gt;%<br />
	filter( str_length( word ) &gt; 3 )</p>
<p># Most common words<br />
voynich_common &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, sort=TRUE ) %&gt;%<br />
	mutate( word = reorder( word, n ) )</p>
<p># Counts of words per folio<br />
voynich_word_counts &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, folio, sort = TRUE ) </p>
<p># TF-IDF<br />
voynich_tf_idf &lt;-<br />
	voynich_word_counts %&gt;%<br />
	bind_tf_idf( word, folio, n ) %&gt;%<br />
	arrange( desc( tf_idf ) )</p>
<p># Based on median stopword count of languages in NLTK<br />
# (&lt;https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip&gt;),<br />
# remove the 200 lowest-scoring words.<br />
voynich_stopwords &lt;-<br />
	voynich_tf_idf %&gt;%<br />
	arrange( idf  ) %&gt;%<br />
	select( word ) %&gt;%<br />
	unique() %&gt;%<br />
	head( 200 ) %&gt;%<br />
	extract2( &quot;word&quot; )</p>
<p>voynich_words &lt;-<br />
	voynich_words %&gt;%<br />
	filter( !word %in% voynich_stopwords  ) </p>
<p># Generate the corpus<br />
voynich_dfm &lt;-<br />
	voynich_words %&gt;%<br />
	count( folio, word, sort=TRUE ) %&gt;%<br />
	cast_dfm( folio, word, n )</p>
<p># Search for a number of topics and output goodness-of-fit measures. </p>
<p># N=20 is the number of documents &#8216;held out&#8217; for the goodness-of-fit measure.<br />
# (The model is trained on the main body, then used to calculated the<br />
# likelihood of the held-out documents.) N=20 is used here to produce<br />
# approximately 10% of the corpus.<br />
if( not( file.exists( &quot;work/voynich_topic_search_k.rds&quot; ) ) ) {</p>
<p>	message( &quot;Seaching low-n topic models&#8230;&quot; )</p>
<p>	voynich_k &lt;-<br />
		searchK( voynich_dfm, K=c(3:40), N=20 )</p>
<p>	saveRDS( voynich_k, &quot;work/voynich_topic_search_k.rds&quot; )</p>
<p>} else {</p>
<p>	voynich_k &lt;-<br />
		readRDS( &quot;work/voynich_topic_search_k.rds&quot; )</p>
<p>}</p>
<p># Based on the metrics above, use 34-topic model<br />
if( not( file.exists( &quot;work/voynich_topic_stm-34.rds&quot; ) ) ) {</p>
<p>	message( &quot;Calculating 34-topic model&#8230;&quot; )</p>
<p>	# Setting K=0 uses (Lee and Minno, 2014) to select a number of topics<br />
	voynich_topic_model_34 &lt;-<br />
		stm( voynich_dfm, K=34, init.type=&quot;Spectral&quot; )</p>
<p>	# This takes a long time, so save output<br />
	saveRDS( voynich_topic_model_34, &quot;work/voynich_topic_stm-34.rds&quot; )</p>
<p>} else {</p>
<p>	voynich_topic_model_34 &lt;- readRDS( &quot;work/voynich_topic_stm-34.rds&quot; )</p>
<p>}</p>
<p># Based on the metrics above, also calculated a secondary 12-topic model<br />
if( not( file.exists( &quot;work/voynich_topic_stm-12.rds&quot; ) ) ) {</p>
<p>	message( &quot;Calculating 12-topic model&#8230;&quot; )</p>
<p>	# Setting K=0 uses (Lee and Minno, 2014) to select a number of topics<br />
	voynich_topic_model_12 &lt;-<br />
		stm( voynich_dfm, K=12, init.type=&quot;Spectral&quot; )</p>
<p>	# This takes a long time, so save output<br />
	saveRDS( voynich_topic_model_12, &quot;work/voynich_topic_stm-12.rds&quot; )</p>
<p>} else {</p>
<p>	voynich_topic_model_12 &lt;- readRDS( &quot;work/voynich_topic_stm-12.rds&quot; )</p>
<p>}</p>
<p># Work initially with the 34-topic model<br />
voynich_topic_model &lt;- voynich_topic_model_34</p>
<p>## Convert output to a tidy tibble<br />
voynich_topic_model_tbl &lt;-<br />
	tidy(voynich_topic_model, matrix = &quot;beta&quot; )</p>
<p>voynich_terms &lt;-<br />
	tidy(voynich_topic_model, matrix = &quot;gamma&quot; )</p>
<p># Select the top six terms in each topic for display<br />
voynich_topics_top &lt;-<br />
	voynich_topic_model_tbl %&gt;%<br />
	group_by(topic) %&gt;%<br />
	top_n(6, beta) %&gt;%<br />
	ungroup() %&gt;%<br />
	arrange(topic, -beta)</p>
<p># Produce a per-folio topic identification.<br />
# Document &#8211; topic &#8211; score<br />
topic_identity &lt;-<br />
	voynich_terms %&gt;%<br />
	group_by( document ) %&gt;%<br />
	top_n( 1, gamma ) %&gt;%<br />
	arrange( document ) %&gt;%<br />
	ungroup</p>
<p># Reinsert manually-identified section information (derived from<br />
# illustrations).<br />
topic_identity$section &lt;- voynich_tbl$section </p>
<p>saveRDS( topic_identity, &quot;work/topic_identity.rds&quot; )</p>
[/code]
</div></div>
</div>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Topic Metric Selection Plotting Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_topics-plot_metric.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( ggplot2 )<br />
library( cowplot )</p>
<p># References:<br />
# &lt;http://varianceexplained.org/r/op-ed-text-analysis/&gt;<br />
# &lt;https://cbail.github.io/SICSS_Topic_Modeling.html#working-with-meta-data&gt;<br />
# &lt;https://me.eui.eu/andrea-de-angelis/blog/structural-topic-models-to-study-political-text-an-application-to-the-five-star-movements-blog/&gt;<br />
# &lt;https://scholar.princeton.edu/sites/default/files/bstewart/files/stm.pdf&gt;<br />
# &lt;https://juliasilge.com/blog/evaluating-stm/&gt;</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># Read topic model search<br />
voynich_k &lt;-<br />
	readRDS( &quot;work/voynich_topic_search_k.rds&quot; )</p>
<p># Plot semantic coherence against exclusivity for model selection<br />
voynich_k_plot &lt;-<br />
	voynich_k$results %&gt;%<br />
	as_tibble %&gt;%<br />
	rename( &quot;Semantic Coherence&quot;=semcoh, &quot;Exclusivity&quot;=exclus ) %&gt;%<br />
	gather( key=&quot;variable&quot;, value=&quot;value&quot;, &quot;Semantic Coherence&quot;, &quot;Exclusivity&quot; ) %&gt;%<br />
	rename( &quot;Topic Count&quot;=K, &quot;Value&quot;=value )</p>
<p># We will use cowplot, so set the theme here.<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Plot semantic coherence against exclusivity.<br />
# (Highlight the selected 34-topic point.)<br />
voynich_k_semcoh_exclusive &lt;-<br />
	ggplot( voynich_k_plot, aes( x=`Topic Count`, y=Value, group=variable) ) +<br />
	geom_line( colour=&quot;#8a0707&quot; ) +<br />
	facet_wrap( ~variable, ncol=1, scales=&quot;free_y&quot; ) +<br />
	geom_vline( xintercept=34, colour=&quot;#228b22&quot;, linetype=&quot;longdash&quot; ) +<br />
	scale_x_continuous(breaks=c( seq(0, 40, 10), 34 ) ) +<br />
	theme (<br />
			 axis.title.y = element_text( angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.title.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.text.y = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title = element_blank(),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 legend.position=c(.85,.5),<br />
			 strip.background = element_blank(),<br />
			 strip.text.x = element_text(size = 10, family=&quot;main_font&quot;)<br />
	)  </p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Vonich Manuscript Topic Selection Metrics&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data: http://www.voynich.nu&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=8, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, voynich_k_semcoh_exclusive, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>voynich_topic_selection_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>ggsave( &quot;output/voynich_topic_selection_metrics.pdf&quot;, width=16, height=9 )</p>
[/code]
</div></div>
</div>
<p>With these torturous steps on our path finally trodden, our path leads at last to a model deriving the underlying word generating probabilities of the Voynich Manuscript. In each facet, the highest-probability words in each topic are shown in order.</p>
<figure id="attachment_1215" aria-describedby="caption-attachment-1215" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.png"><img loading="lazy" decoding="async" data-attachment-id="1215" data-permalink="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/voynich_topic_plot_34-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Voynich Manuscript Topic Model" data-image-description="&lt;p&gt;Voynich Manuscript Topic Model (34 topics)&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript Topic Model (34 topics)&lt;/p&gt;
" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.png" alt="Voynich Manuscript Topic Model" width="1920" height="1080" class="size-full wp-image-1215" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-1215" class="wp-caption-text">Voynich Manuscript Topic Model (34 topics) | (<a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Topic Model Plotting Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_topics-plot_topics.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( tidytext )</p>
<p># install_github(&quot;dgrtwo/drlib&quot;)<br />
library( drlib )</p>
<p>library( ggplot2 )<br />
library( cowplot )</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># ####################<br />
# ## 34 Topic Model ##<br />
# ####################</p>
<p># Work with the 34-topic model<br />
voynich_topic_model &lt;-<br />
	readRDS( &quot;work/voynich_topic_stm-34.rds&quot; )</p>
<p>## Convert output to a tidy tibble<br />
voynich_topic_model_tbl &lt;-<br />
	tidy(voynich_topic_model, matrix = &quot;beta&quot; )</p>
<p>voynich_terms &lt;-<br />
	tidy(voynich_topic_model, matrix = &quot;gamma&quot; )</p>
<p># Select the top six terms in each topic for display<br />
voynich_topics_top &lt;-<br />
	voynich_topic_model_tbl %&gt;%<br />
	group_by(topic) %&gt;%<br />
	top_n(6, beta) %&gt;%<br />
	ungroup() %&gt;%<br />
	arrange(topic, -beta)</p>
<p># We will use cowplot, so set the theme here.<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Plot each topic as a geom_col(). Use drlib&#8217;s &#8216;reorder_within&#8217; to order bars<br />
# within each facet. (Note that the scale_x_reordered() is needed to fix<br />
# (flipped!) x-axis labels in the output.)<br />
# &lt;https://juliasilge.com/blog/reorder-within/&gt;<br />
gp &lt;-<br />
	voynich_topics_top %&gt;%<br />
	mutate(term = reorder_within(term, beta, topic)) %&gt;%<br />
	ggplot(aes(term, beta, fill = factor(topic))) +<br />
	geom_col( alpha=0.8, show.legend = FALSE) +<br />
	theme( axis.text.y = element_text( family=&quot;voynich_font&quot;, size=10 ) ) +<br />
	facet_wrap(~ topic, scales = &quot;free&quot;) +<br />
	scale_x_reordered() +<br />
	coord_flip() +<br />
	labs( x=&quot;Term&quot;, y=&quot;Probability in Topic&quot; )</p>
<p># Theming<br />
gp &lt;-<br />
	gp +<br />
	theme (<br />
			 axis.title.y = element_text( margin = margin(t = 0, r = 12, b = 0, l = 0), angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.title.x = element_text( margin = margin(t = 12, r = 0, b = 0, l = 0), colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=6 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title = element_blank(),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 legend.position=c(.85,.5),<br />
			 strip.background = element_blank(),<br />
			 strip.text.x = element_text(size = 10, family=&quot;main_font&quot;)<br />
			 ) </p>
<p>gp &lt;-<br />
	gp +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			legend.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;)<br />
	)</p>
<p># Palette of ink colours (based on screenshots of Diamine inks).<br />
ink_colours &lt;- c( &quot;#753733&quot;, &quot;#b6091d&quot;, &quot;#e45025&quot;, &quot;#232d1d&quot;,<br />
					  	&quot;#224255&quot;, &quot;#533f50&quot;, &quot;#453437&quot;, &quot;#7f2430&quot;,<br />
						&quot;#254673&quot;, &quot;#52120e&quot;, &quot;#3d2535&quot;, &quot;#25464b&quot;,<br />
						&quot;#2f2a1c&quot; )</p>
<p># Create a vector of selections from the palette, one for each topic.<br />
ink_palette &lt;-<br />
	sample( ink_colours, size=34, replace=TRUE )</p>
<p># Add fill colours to plot.<br />
gp &lt;-<br />
	gp + scale_fill_manual( values=ink_palette )</p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Vonich Manuscript Topic Model (34 Topics)&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data: http://www.voynich.nu&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=8, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, gp, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>voynich_topic_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>ggsave( &quot;output/voynich_topic_plot_34.pdf&quot;, width=16, height=9 )</p>
[/code]
</div></div>
</div>
<h1>Of Man and Machine</h1>
<p>The topic model produces a set of topics in the form of probability distributions generating words. The association of each topic to a folio in the Voynich Manuscript represents these probabilistic assignments based solely on the distribution of words in the text. There is a secondary topic identification, however, tentatively proposed by scholars of the manuscript. The obscure diagrams decorating almost every folio provide their own startling implications as to the themes detailed in the undeciphered prose.</p>
<p>We might wish to ask, then: do the topic assignments generated by the machine reflect the human intepretation? To what extent do pages decorated with herbal illuminations follow certain machine-identified topics compared with those assigned to astronomical charts?</p>
<p>The illustration-based thematic sections of the Voynich Manuscript fall into eight broad categories, according to <a href="http://www.voynich.nu/descr.html#illustr">Zandbergen</a>. These sections are, briefly:</p>
<ul>
<li>Herbal, detailing a range of unidentified plants, comprising most of the first half of the manuscript;</li>
<li>astronomical, focusing on stars, planets, and astronomical symbols;</li>
<li>cosmological, displaying obscure circular diagrams of a similar form to the astronomical;</li>
<li>astrological, in which small humans are displayed mostly in circular diagrams alongside zodiac signs;</li>
<li>biological, characterised by small drawings of human figures, often connected by tubes;</li>
<li>pharmaceutical, detailing parts of plants and vessels for their preparation;</li>
<li>starred text, divided into short paragraphs marked with a star, with no other illustrations; and</li>
<li>text only pages.</li>
</ul>
<p>With these contextual descriptions, we can examine the relationship between the speculative assignments of the topic model against the suggestions of the diagrams.</p>
<figure id="attachment_1219" aria-describedby="caption-attachment-1219" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.png"><img loading="lazy" decoding="async" data-attachment-id="1219" data-permalink="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/voynich_folio_topic_heatmap-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Voynich Manuscript Folio Topic Heatmap" data-image-description="&lt;p&gt;Voynich Manuscript Folio Topic Heatmap&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript Folio Topic Heatmap&lt;/p&gt;
" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.png" alt="Voynich Manuscript Folio Topic Heatmap" width="1920" height="1080" class="size-full wp-image-1219" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-1219" class="wp-caption-text">Voynich Manuscript Folio Topic Heatmap | (<a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript Topic Model Folio Heatmap Plotting Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_topics-plot_heatmap.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p># install_github(&quot;dgrtwo/drlib&quot;)<br />
library( drlib )</p>
<p>library( ggplot2 )<br />
library( cowplot )</p>
<p>font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># Set the number of topics<br />
n_topics &lt;- 34</p>
<p># Load the appropriate topic model<br />
voynich_topic_model &lt;-<br />
	readRDS( paste0( &quot;work/voynich_topic_stm-&quot;, n_topics, &quot;.rds&quot; ))</p>
<p>theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Load folio topic identity assignments<br />
topic_identity &lt;-<br />
	readRDS( &quot;work/topic_identity.rds&quot; )</p>
<p># Cowplot<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Plot topic heatmap<br />
topic_heatmap &lt;-<br />
	topic_identity %&gt;%<br />
	ggplot( aes( x=document, y=topic, fill=section ) ) +<br />
	geom_tile( colour=&quot;#3c3f4a&quot;, alpha=0.8, size=0.4 ) +<br />
	scale_fill_brewer( palette=&quot;Dark2&quot;, direction=1, name=&quot;Section&quot;, labels=c(&quot;Astrological&quot;, &quot;Astronomical&quot;, &quot;Biological&quot;, &quot;Cosmological&quot;, &quot;Herbal&quot;, &quot;Pharmaceutical&quot;, &quot;Starred Text&quot;, &quot;Text Only&quot; ) ) +<br />
	ggtitle( &quot;Voynich Folio Topic Assignments&quot;, paste( n_topics, &quot;Topic Model&quot; )) +<br />
	labs( x=&quot;Folio&quot;, y=&quot;Topic&quot; ) +<br />
	theme (<br />
			 plot.title = element_text( family=&quot;bold_font&quot;, size=22 ),<br />
			 plot.subtitle = element_text( family=&quot;bold_font&quot;, size=12 ),<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 ) +<br />
	scale_y_continuous(labels = seq( 1, n_topics, 1 ), breaks = seq( 1, n_topics, 1 ), minor_breaks = seq(0.5 , n_topics+.5, 1) ) +<br />
	scale_x_continuous(minor_breaks = seq(0.5 , 226.5, 5) ) </p>
<p>gp &lt;-<br />
	topic_heatmap +<br />
	theme (<br />
			 axis.title.y = element_text( margin = margin(t = 0, r = 12, b = 0, l = 0), angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.title.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.text.y = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title =element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 strip.background = element_blank(),<br />
			 strip.text.x = element_text(size = 10, family=&quot;main_font&quot;)<br />
			 ) </p>
<p>gp &lt;-<br />
	gp +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			legend.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;)<br />
	)</p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Voynich Manuscript Topic Heatmap&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data: http://www.voynich.nu&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=8, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, gp, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>voynich_topic_heatmap &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>ggsave( &quot;output/voynich_folio_topic_heatmap.pdf&quot;, width=16, height=9 )</p>
[/code]
</div></div>
</div>
<p>The colours in the above plot represent the manual human interpretation, whilst the location on the y-axis shows the latent Dirichlet allocation topic assignment.</p>
<p>We might have harboured the fragile hope that such a diagram would have demonstrated a clear confirmatory delineation between the sectional diagrammatic breakdown of the Voynich Manuscript. At a first inspection, however, the topics identified by the analysis appear almost uniformly distributed across the pages of the manuscript.</p>
<p>The topic model admits to a number of assumptions, not least the selection of stopwords through to the number of topics in the model. We must also be cautious: the apparent distribution of topics over the various sections may be deceptive. For the moment, we can present this initial topic model as a faltering first step in our descent into the hidden structures of the Voynich Manuscript. The next, and final, post in this series will develop both the statistical features and the topic model towards a firmer understanding of whether the apparent shift in theme suggest by the illustrations is statistically supported by the text.</p>
<p>Until then, read deeply but do not trust what you read.</p>
<hr />
<h2>Footnotes</h2>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1200</post-id>	</item>
		<item>
		<title>Illuminating the Illuminated Part One: A First Look at the Voynich Manuscript</title>
		<link>https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/</link>
					<comments>https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Thu, 26 Sep 2019 20:15:16 +0000</pubDate>
				<category><![CDATA[bibliophilia]]></category>
		<category><![CDATA[cryptology]]></category>
		<category><![CDATA[linguistics]]></category>
		<guid isPermaLink="false">http://www.weirddatascience.net/?p=883</guid>

					<description><![CDATA[<div class="mh-excerpt">While the world abounds with strange phenomena ripe for analysis in their raw state, there is a peculiar pleasure in scrutinising arcane information curated and obscured by the human mind.

The Voynich Manuscript is one of the most well-known and studied volumes of occult knowledge. The book's most recent history involves its purchase in 1912 by Wilfrid Voynich, a rare book dealer, from a sale of manuscripts by the Society of Jesus at the Villa Mondragone, Frascati.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/" title="Illuminating the Illuminated Part One: A First Look at the Voynich Manuscript">[...]</a>]]></description>
										<content:encoded><![CDATA[<h1>The Voynich Manuscript</h1>
<p>While the world abounds with strange phenomena ripe for analysis in their raw state, there is a peculiar pleasure in scrutinising arcane information curated and obscured by the human mind.</p>
<p>The Voynich Manuscript is one of the most well-known and studied volumes of occult knowledge. The book&#8217;s most recent history involves its purchase in 1912 by Wilfrid Voynich, a rare book dealer, from a sale of manuscripts by the Society of Jesus at the Villa Mondragone, Frascati. Following several fruitless years of attempts to decipher the manusript and discover its origin, or to interest others in it, Wilfrid Voynich died. The book passed through a number of other hands before being donated to Yale University by the noted rare book dealer Hans P. Kraus in 1969. It now resides in Yale&#8217;s <a href="https://beinecke.library.yale.edu/">Beinecke Rare Book and Manuscript Library</a> with the designation <a href="https://brbl-dl.library.yale.edu/vufind/Record/3519597">MS 408</a>.</p>
<p>Written almost entirely in an unknown script, barring a small number of words apparently in Latin and High German, the manuscript is compellingly illustrated with depictions of plants, herbs, human figures, astronomical and astrological symbols. The manuscript has resisted all attempts at interpretation by cryptographers, historians, and linguists.</p>
<figure id="attachment_910" aria-describedby="caption-attachment-910" style="width: 7486px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg"><img loading="lazy" decoding="async" data-attachment-id="910" data-permalink="https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/voynich_folio_178/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg" data-orig-size="7486,3715" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="voynich_folio_178" data-image-description="&lt;p&gt;Voynich Manuscript &amp;#8211; Folio 178&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript &amp;#8211; Folio 178&lt;/p&gt;
" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg" src="http://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg" alt="Voynich Manuscript Folio 178" width="7486" height="3715" class="size-full wp-image-910" /></a><figcaption id="caption-attachment-910" class="wp-caption-text">Voynich Manuscript &#8211; Folio 178</figcaption></figure>
<p>From a linguistic and cryptographic perspective, this lack of success in interpretation is not surprising. The two-hundred or so folios of the manuscript, while beautifully illuminated, present a sadly limited corpus of text for the purposes of traditional analysis.</p>
<p>In this short series of posts we will subject the Voynich Manuscript to a range of text analysis techniques, delving into its structure, gain horrific insight into its composition, and skeptically assessing its credibility. The manuscript has been subjected to almost fifty years of furtive attempts by cryptographers, including the <a href="https://apps.dtic.mil/dtic/tr/fulltext/u2/a070618.pdf">US National Security Agency</a> and a menagerie of others from the distinguished to the deranged. We will crudely mimic some earlier results, and hopefully add our own confusion to the roiling mass of current research into the Voynich Manuscript.</p>
<h1>Authenticity</h1>
<p>Since its discovery, and throughout the ongoing unsuccessful attempts to decipher its contents, many have questioned the authenticity of the Voynich Manuscript. The theory that the entire book is a hoax, either by contemporary scribes or by more modern players, has been raised repeatedly over the years.</p>
<p>Radiocarbon dating in 2010 <a href="https://uanews.arizona.edu/story/ua-experts-determine-age-of-book-nobody-can-read">asserted that the manuscript&#8217;s parchment likely dates from the early 15th century</a>; the volume of parchment in the manuscript, and its consistency across the document, make it unlikely, although not impossible, that the book is a modern-day hoax.</p>
<p>Other supporting evidence has drawn from early mentions of the manuscript in correspondence. According to <a href="http://www.voynich.nu/index.html">http://www.voynich.nu</a>, which presents a far more detailed and thorough description of the research around the manuscript and its history than we could hope to offer here, the first extant mention of the manuscript can be found in a 1639 <a href="http://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_letter_39a.jpg" alt="1639 letter from Athanasius Kircher in Rome, replying to a letter forwarded from Georgius Barschius of Prague by the mathematician Theodor Moretus.">letter</a> from <a href="https://en.wikipedia.org/wiki/Athanasius_Kircher">Athanasius Kircher</a> in Rome, replying to a letter forwarded from <a href="https://en.wikipedia.org/wiki/Georg_Baresch">Georgius Barschius</a> of Prague by the mathematician <a href="https://en.wikipedia.org/wiki/Theodorus_Moretus">Theodor Moretus</a>.</p>
<p>The letter refers to a <a href="http://www.voynich.nu/letters.html">&#8220;book of mysterious steganography&#8221;</a> (<em>&#8220;libellum&#8230; &#8230;steganographici mysterisi&#8221;</em>) illustrated with pictures of plants, stars and chemical secrets that Kirscher had not yet had time to decipher. Barschius had sought out Kirscher&#8217;s expertise due to his fame at the time for claiming to have, erroneously as it later transpired, deciphered the hieroglyphic writing system of the Ancient Egyptian language. Later correspondence between Barschius and Kirscher appears, according to Zandbergen<span id='easy-footnote-4-883' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/#easy-footnote-bottom-4-883' title='&lt;a href=&quot;http://www.voynich.nu/letters.html&quot;&gt;http://www.voynich.nu/letters.html&lt;/a&gt;'><sup>4</sup></a></span>, to suggest strongly that the mysterious book in question is the Voynich Manuscript based on its description.</p>
<h1>A Statistical Argument: Zipf&#8217;s Law</h1>
<p>We now turn from historical sources to darker, more statistical realms. There is compelling support for the notion that, regardless of the true meaning of the book, its contents are drawn from a human language and are neither random symbols nor any form of sophisticated cipher.</p>
<p>One of the pillars of this argument is that certain statistical properties of the Voynich Manuscripts text strongly resemble those of natural, human languages, and which are unlikely, although not impossible, to arise from random text, artificially generated text, or most forms of encipherment.</p>
<p>The most well-known of these statistical properties is the apparent adherence of the manuscript to <a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf&#8217;s Law</a>. This law, <a href="https://mitpress.mit.edu/books/psycho-biology-language">made famous by</a> the US linguist <a href="https://en.wikipedia.org/wiki/George_Kingsley_Zipf">George Zipf</a>, observes that in corpora of natural languages, the frequency of a word is inversely proportional to its rank when words from a corpus are ordered by frequency. More plainly: the most common word in a language is likely to be <em>n</em> times more common than the second most common word; the second word will be roughly <em>n</em> times more common than the third word, and so on. Whilst merely an approximation, this law can be seen to hold for most human languages, and for a range of other natural phenomena.</p>
<p>Random gibberish, on the other hand, would most likely not follow Zipf&#8217;s Law, although carefully crafted gibberish certainly could. <a href="https://www.tandfonline.com/doi/abs/10.1080/0161-110491892755">Rugg</a> has demonstrated that a simple mechanical procedure can produce randomised text that adhered to Zipf&#8217;s Law, although the example he provides is both somewhat contrived and also presupposes a knowledge of this statistical quirk of human languages in the first place. Given that the physical makeup of the Voynich Manuscript dates to the early 15th Century, some four centuries before Zipf popularised this mathematical assessment of human languages, the argument that it is a contemporary act of calligraphic glossolalia seems strained.</p>
<p>Similarly, most forms of cryptography beyond the simplest <a href="https://en.wikipedia.org/wiki/Substitution_cipher">substitution ciphers</a> would also skew the text away from Zipf&#8217;s Law. It is notable that the Voynich Manuscript predates even works such as <a href="https://en.wikipedia.org/wiki/Johannes_Trithemius">Trithemius</a>&#8216;s <a href="https://en.wikipedia.org/wiki/Steganographia">Steganographia</a>, or the <a href="https://en.wikipedia.org/wiki/Book_of_Soyga">Book of Soyga</a> and its <a href="https://link.springer.com/chapter/10.1007/1-4020-4246-9_10">magic tables</a> of letters that so obsessed <a href="https://en.wikipedia.org/wiki/John_Dee">John Dee</a>.</p>
<p>In contrast, however, it has been claimed that other features of the text raise doubts. One of the most commonly stated counter-arguments to the natural hypothesis of the Voynich text is that some words are repeated an unnatural number of times. Depending on the transcription, individual words have been reported to be repeated up to five times. Whilst this is not an impossible occurrence in human lanugage, it is highly irregular.</p>
<p>The next post in this short series will focus on the Voynich Manuscript&#8217;s adherence, or lack thereof, to Zipf&#8217;s Law in full. Following that, we will see the extent to which other forms of modern textual analysis can be applied to dissect the arcane and unrelenting secrets of MS 408.</p>
<p>This post, however, will describe the contortions required to render the Voynich text suitable for our particular form of scrutiny.</p>
<h1>Assumptions</h1>
<p>Given the format and presentation of the text, we make several assumptions about the writing system contained in the Voynich Manuscript:</p>
<ul>
<li>It is written in an alphabet, or potentially an <a href="https://en.wikipedia.org/wiki/Abjad">abjad</a> or even an <a href="https://en.wikipedia.org/wiki/Abugida">abugida</a><span id='easy-footnote-5-883' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/#easy-footnote-bottom-5-883' title='Indeed, the &lt;a href=&quot;https://en.wikipedia.org/wiki/Ge%CA%BDez&quot;&gt;Ge&amp;#8217;ez language&lt;/a&gt;, from which the term &lt;em&gt;abugida&lt;/em&gt; was derived, has at various times been proposed as a candidate for the source language of the Voynich Manuscript'><sup>5</sup></a></span>, and not a <a href="https://en.wikipedia.org/wiki/Logogram">logographic system</a>. That the text is not logographic is justified by the small number of individual symbols. The distinction between the other systems is sufficiently subtle that it will not affect our analyses<span id='easy-footnote-6-883' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/#easy-footnote-bottom-6-883' title='In fact, most of the analyses we perform would also function in a logographic system.'><sup>6</sup></a></span>.</li>
<li>The manuscript is written from left to right, and not the reverse, vertically, <a href="https://en.wikipedia.org/wiki/Boustrophedon">boustrophedon</a>. This is uncontroversial and apparent from even a cursory inspection of the text itself; the horizontal flow of the writing is clear, with lines clearly starting at the left margin and ending before the right. The text is separated into paragraphs, of which the final line is justified to the left.</li>
</ul>
<h1>Data</h1>
<p>Due to the diligent activity of several generations of Voynich researchers, the text of the manuscript has been transcribed into a machine-readable format. As the alphabet is unknown, there are minor uncertainties in rendering the text, leading to a number of similar but competing transcriptions. The subtle details of the various transcription efforts, and their history, are available at: <a href="http://www.voynich.nu/transcr.html">http://www.voynich.nu/transcr.html</a>, with the raw data available at <a href="http://www.voynich.nu/data/">http://www.voynich.nu/data/</a>. We have settled on the <a href="http://www.voynich.nu/transcr.html#v101">v101</a> transliteration by Glen Claston, rendered in the <a href="https://www.voynich.nu/data/IVTFF_format.pdf">Intermediate Voynich Transliteration File Format</a> (IVTFF) of Zandbergen. This is one of the more recent and widely-used transcriptions, and has the added advantage of being supported by the availability of a <a href="http://www.voynich.nu/roadmap.html#fonts">TrueType font</a>. The underlying file is available here: <a href="http://www.voynich.nu/data/GC_ivtff_s.txt"> http://www.voynich.nu/data/GC_ivtff_s.txt</a>.</p>
<h1>Crude Manipulations</h1>
<p>We perform the following steps to make the data usable for our analyses. For many scenarios, we would develop a generalisable set of steps to allow conversion of many documents to an appropriate form. Until and unless, however, a new cache of documents in the same language are found, it is simpler and easier to perform these one-time steps manually.</p>
<p>Firstly, we delete from the text all incomplete words, as marked in the IVTFF format. This includes:</p>
<ul>
<li>all text in angle brackets</li>
<li>all words containing ?&#8217;s</li>
<li>all words containing []</li>
</ul>
<p>Secondly, we tokenize the text and remove punctuation. The transcription of the Voynich manuscript that we have chosen uses the following punctuation:</p>
<ul>
<li>&#8220;.&#8221; is a space</li>
<li>&#8220;,&#8221; is a potential space. For simplicity, we do not treat these as a space.</li>
</ul>
<p>Finally, we organize the document in an appropriate form to be imported into an R data frame, or tidyverse tibble.</p>
<p>The above steps were performed in the Vim text editor, and the commands used are reproduced in the code below:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show Vim text manipulation commands.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
[code]
<p># Delete all commented lines<br />
:\%g/\^\#.\*/d</p>
<p># Remove blank lines<br />
:\%g/\^\$/d</p>
<p># Remove &quot;,&quot; &#8212; assume that potential spaces are /not/ spaces.<br />
:\%s/,//g</p>
<p># Replace each folio&#8217;s page marker (initial for each page)<br />
# with its contents, followed by a comma. (&lt;f1r&gt; -\&gt; f1r,)<br />
:\%s/\^&lt;fRos&gt;\s*\&lt;.{-}\&gt;\$/\rfRos,/<br />
:\%s/\^\&lt;(f\[0-9\]*\[r\|v\]\[0-9\]*)\&gt;\s*\&lt;.{-}\&gt;\$/\r\1,/</p>
<p># Remove all \&lt;\&gt; entries (non-greedy)<br />
:\%s/\&lt;.{-}\&gt;\s*//g</p>
<p># Join all paragraphs (all newlines followed by a character<br />
# other than a newline are removed).<br />
:\%s/\n(\[\^\\n\])/.\\1/<br />
:\%s/\^.f/f/<br />
:\%s/,./,/</p>
<p># Replace &quot;high ascii&quot; rare characters from the IVTFF with their<br />
# ASCII representation. (&lt;http://www.voynich.nu/img/extra/v101a.jpg&gt;)<br />
:\%s/@(\[0-9\]{-});/=nr2char(submatch(1))/g</p>
<p># Replace full stops with spaces<br />
:\%s/./ /g</p>
[/code]
</div></div>
</div>
<p>The resulting raw data file is available <a href="http://www.weirddatascience.net/wp-content/uploads/2019/09/GC_ivtff_s-processed.txt">here</a>. This file can be read into R simply by use of the <code>read.csv</code> function:</p>
<pre class="brush: r; title: ; notranslate">
voynich_tbl &lt;- 
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%
	rename( folio = X1, text = X2 )
</pre>
<p>As a first, horrifying glance into the forms of analysis that this allows, we can now use our raw data to identify the most repeated words in the manuscript, according to our transcription. The following R code extracts the entirety of the text and encodes it as a <a href="https://en.wikipedia.org/wiki/Run-length_encoding">run length encoding</a>. This conveniently results in a sequential list of words and the number of times that each is repeated <em>in sequence</em>. We can then simply extract the largest number of repetitions for each word in the corpus:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Count longest word repetition sequences in the Voynich Manuscript.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p># Count the number of repeated words in the Voynich Manuscript text.</p>
<p># Load the raw data<br />
voynich_tbl &lt;-<br />
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Extract the text as a vector of words<br />
voynich_vector &lt;-<br />
	voynich_tbl %&gt;%<br />
	extract2( &quot;text&quot; ) %&gt;%<br />
	paste( sep=&quot; &quot;, collapse=&quot; &quot; ) %&gt;%<br />
	str_split( &quot; &quot; ) %&gt;%<br />
	unlist</p>
<p># Create a run length encoding object from the vector<br />
voynich_rle &lt;-<br />
	voynich_vector %&gt;%<br />
	rle</p>
<p># Convert rle object to a data frame and report the maximum number of repeated<br />
# cases for each word<br />
voynich_repetitions &lt;-<br />
	voynich_rle %&gt;%<br />
	unclass %&gt;%<br />
	as.data.frame %&gt;%<br />
	group_by( values ) %&gt;%<br />
	summarise( max_repetitions = max( lengths ) ) %&gt;%<br />
	ungroup %&gt;%<br />
	arrange( desc( max_repetitions ) )</p>
[/code]
</div></div>
</div>
<p>This simple analysis shows that, in the transcription we have chosen, the longest sequences of repeated words are only three words in length, occuring a total of five times in the text. While there are many other arguments against the potential validity of the Voynich Manuscript, word repetition does in itself present a compelling reason to doubt that the text is a human language.</p>
<p>We have now reduced the strange and beautiful elegance of the Voynich Manuscript&#8217;s centuries-old illuminations to a crude, utilitarian abstraction. With this particular act of artistic and literary desecration complete, in the next post we will examine Zipf&#8217;s Law in more detail, and interrogate the extent to which this law supports or undermines the text&#8217;s authenticity.</p>
<h2>Footnotes</h2>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">883</post-id>	</item>
	</channel>
</rss>
