<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Weird Data Science</title>
	<atom:link href="https://www.weirddatascience.net/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.weirddatascience.net</link>
	<description>Paranormal Distributions. Cyclopean Data. Esoteric Regression.</description>
	<lastBuildDate>Sat, 15 Nov 2025 14:01:45 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.1</generator>
<site xmlns="com-wordpress:feed-additions:1">143387998</site>	<item>
		<title>Bayes vs. the Invaders (Redivivus)</title>
		<link>https://www.weirddatascience.net/2025/11/15/bayes-vs-the-invaders-redivivus/</link>
					<comments>https://www.weirddatascience.net/2025/11/15/bayes-vs-the-invaders-redivivus/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Sat, 15 Nov 2025 10:50:27 +0000</pubDate>
				<category><![CDATA[event]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[stan]]></category>
		<category><![CDATA[ufo]]></category>
		<guid isPermaLink="false">https://www.weirddatascience.net/?p=10231</guid>

					<description><![CDATA[<div class="mh-excerpt">Unidentifiable aerial and marine phenomena. Impossible lights in the sky. Patterns of visitation and terror. Insidious influences from the hadal voids between the stars. Who--what--swoop and glide through the ink-black nights of our world, probing and testing our structures, our societies, our minds? From barely remembered history, to early reports of impossible objects, to blurrily evidenced documentation, data concerning flying arcane observations has grown and twisted, along with our capacity to lay them bare, to subject them to analysis, and to interrogate their secrets.

In this year's OII Halloween Lecture, we will tremblingly revisit a Bayesian analysis of seventy years of UFO sightings, drawn from a dataset collected by the National UFO Reporting Center (NUFORC). Scepticism, fear, doubt, and most accepted standards of statistical rigor, will be cast aside in our unyielding and disquieting pursuit of an uncompromised truth.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2025/11/15/bayes-vs-the-invaders-redivivus/" title="Bayes vs. the Invaders (Redivivus)">[...]</a>]]></description>
										<content:encoded><![CDATA[<p>Straying still further from whatever dubious graces it once presumed, academia&#8217;s luciferous descent into murky realms of huddled speculation continues unabated. Relocated, reconstituted, in ever-fading cycles, the Oxford Internet Institute at the University of Oxford once again saw fit to deprive its students of the peace and comfort of banal rationality through this fourth annual Halloween Lecture.</p>
<p>In the absence of fresh insight, this year&#8217;s lecture revisits the chilling implications of humanity&#8217;s contact with inexplicable aerial and marine phenomena, drawing from the faintest whispers of the ante-historical record through to the shimmering echoes of statistical reasoning. Through what means do visitors from beyond the void encroach on our night-time skies? What subtle deceptions underpin their visible geometries? What values lie behind their peculiar interests in certain uncomfortably-favoured regions?</p>
<p>Despite all safeguards, and in the face of numerous barely-perceptible currents opposing such efforts, this event was captured, stored, and released into a world still cruelly unprepared to face its findings.</p>
<p>More details, and the underlying code, for these findings can be found&#8211;for those unwary enough to look&#8211;in the series of entries beginning <a href="https://www.weirddatascience.net/2019/04/03/bayes-vs-the-invaders-part-one-the-37th-parallel/">here</a>.</p>
<p><a href="https://www.youtube.com/watch?v=WobPKY4UZc0"><img decoding="async" src="https://img.youtube.com/vi/WobPKY4UZc0/2.jpg" alt="Bayes vs. the Invaders (Redivivus)"></a></p>
<p><a href="https://www.youtube.com/watch?v=WobPKY4UZc0">Click here to view the video on YouTube</a>.</p>

<blockquote><p>
<strong>Oxford Internet Institute Halloween Lecture</strong><br />
Bayes vs. the Invaders (Redivivus): A Bayesian Analysis of 70 Years of UFO Sightings<br />
<em>Prof. Joss Wright</em><br />
<em>Oxford. October 2025</em></p>
<p>The searing heat of summer retreats, cools, fades, surrendering its vitality to the flickering uncertainties of autumn. Nights draw close, like dimly-remembered friends clustering in our dreams. The spring leaves abandon their verdant dance, as they age, wither, and drift into a russet swirl of skeletal, wind-stirred fragments. </p>
<p>The dying seasons return, dragging with them time-hallowed fears and uneasy rumours, pooling around these darkly dreaming spires in a mire of primal superstition. The agonizingly brittle certainties of the modern enlightenment, our desperate faith in the gossamer fabrics of scientific progress, falter in the face of primal terrors that lurk implacably in the gloom.</p>
<p>In these darkening days, as faith in our treasured understanding dims, it is yet again time to turn our faces fully to the darkness. Halloween, slouching inexorably towards our minds, impels us as scholars to gather our methods, our theories, our data, our knowledge; and glean what light we can from the primordial glimmers of the unknown.</p>
<p>Unidentifiable aerial and marine phenomena. Impossible lights in the sky. Patterns of visitation and terror. Insidious influences from the hadal voids between the stars. Who&#8211;what&#8211;swoop and glide through the ink-black nights of our world, probing and testing our structures, our societies, our minds? From barely remembered history, to early reports of impossible objects, to blurrily evidenced documentation, data concerning flying arcane observations has grown and twisted, along with our capacity to lay them bare, to subject them to analysis, and to interrogate their secrets.</p>
<p>In this year&#8217;s OII Halloween Lecture, we will tremblingly revisit a Bayesian analysis of seventy years of UFO sightings, drawn from a dataset collected by the National UFO Reporting Center (NUFORC). Scepticism, fear, doubt, and most accepted standards of statistical rigour, will be cast aside in our unyielding and disquieting pursuit of an uncompromised truth.
</p></blockquote>
<a href="https://www.weirddatascience.net/wp-content/uploads/2025/11/2025-bayes_vs_invaders-redivivus.pdf" class="pdfemb-viewer" style="" data-width="max" data-height="max" data-mobile-width="500"  data-scrollbar="none" data-download="on" data-tracking="on" data-newwindow="on" data-pagetextbox="off" data-scrolltotop="off" data-startzoom="100" data-startfpzoom="100" data-toolbar="bottom" data-toolbar-fixed="off">2025-bayes_vs_invaders-redivivus<br/></a>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2025/11/15/bayes-vs-the-invaders-redivivus/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">10231</post-id>	</item>
		<item>
		<title>Numbers of the Beast: Sasquatch Distribution Modelling</title>
		<link>https://www.weirddatascience.net/2025/07/21/numbers-of-the-beast-sasquatch-distribution-modelling/</link>
					<comments>https://www.weirddatascience.net/2025/07/21/numbers-of-the-beast-sasquatch-distribution-modelling/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Mon, 21 Jul 2025 18:09:01 +0000</pubDate>
				<category><![CDATA[cryptozoology]]></category>
		<category><![CDATA[event]]></category>
		<category><![CDATA[maps]]></category>
		<category><![CDATA[spatial analysis]]></category>
		<category><![CDATA[stan]]></category>
		<guid isPermaLink="false">https://www.weirddatascience.net/?p=9482</guid>

					<description><![CDATA[<div class="mh-excerpt">The third OII Halloween Lecture sinks bodily into the tortured mass of data concerning cryptozoological sightings in North America. Drawing on over a century of shadow-haunted sightings documenting the curiously repellant presence of the North America Sasquatch, or Bigfoot, we aim to identify the factors associated with its presence, delineate the confounding presence of other dread manifestations, and cast our minds globally for a faded glimpse of its remote and scarcely-conceived brethren.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2025/07/21/numbers-of-the-beast-sasquatch-distribution-modelling/" title="Numbers of the Beast: Sasquatch Distribution Modelling">[...]</a>]]></description>
										<content:encoded><![CDATA[<p>Dredged from the stygian depths of last year&#8217;s waning: a cruelly belated artefact of academia&#8217;s primordial descent. For a third time, the Oxford Internet Institute at the University of Oxford—scarcely willing, but falteringly unable to reflect on its own ill-fated decisions—chose to risk the sanity and statistical intuition of its students against the numerically unstable nightmares writhing beneath the tattered fabric of our waking world.</p>
<p>This third OII Halloween Lecture sinks bodily into the tortured mass of data concerning cryptozoological sightings in North America. Drawing on over a century of shadow-haunted sightings documenting the curiously repellant presence of the North America Sasquatch, or Bigfoot, we aim to identify the factors associated with its presence, delineate the confounding presence of other dread manifestations, and cast our minds globally for a faded glimpse of its remote and scarcely-conceived brethren.</p>
<p>In what can only generously be considered an act of gross negligence, this lecture was stored, at great risk to both its author and its audience, digitally. Technology has, once again, brought us closer than we might care to dread to secrets that we were never intended to taste.</p>
<p>A sharper, more visceral presentation of the baroque thinking underpinning these materials will follow, in an ever-spiralling series of posts on this site. Until then, however, incautious travellers may solemnly consider the below.</p>
<p><a href="https://www.youtube.com/watch?v=M_upySOOzcI"><img decoding="async" src="https://img.youtube.com/vi/M_upySOOzcI/2.jpg" alt="Numbers of the Beast: Sasquatch Distribution Modelling"></a></p>
<p><a href="https://www.youtube.com/watch?v=M_upySOOzcI">Click here to view the video on YouTube</a>.</p>

<blockquote><p>
<strong>OII Halloween Lecture</strong><br />
Sasquatch Distribution Modelling: Investigating patterns of Bigfoot sightings in North America.<br />
<em>Prof. Joss Wright</em><br />
<em>Oxford. October 2024</em></p>
<p>The nights grow ever closer. Streetlights flicker, shrouded in the mists, striving to pierce the gloom. The warm certainties of summer give way to the cold, dark ambiguities of autumn. Rumour, myth, and legend rise, primeval, from the shadowed recesses of our collective consciousness, undermining our faith in the fragile congruities that structure our lives.</p>
<p>As summer surrenders to the inexorable tread of autumn, our resolve falters against the unknown horrors residing in the tenebrous peripheries of the world. As scientists, as scholars, our duty is to cling resolutely to our methods and our ideals in the face of encroaching darkness. Our tools may seem fragile in the face of the seething irrationality of the night, but we are called to peer, however tremulously, wherever our inquiries may lead us.</p>
<p>Lights streak across the sky. Stories are woven of twisted faces in the darkness, half-glimpsed creatures in ancient forests, strange encounters in the wilderness. From the earliest stirrings of humanity, to the patterns of complex arcana that silently control our lives today, folklore and legend have long reported phenomena that rebel against mundane description or understanding. As our technologies evolve, and our ability to collate, scrutinise, and manipulate data spiral beyond all restraint, we are ever more capable, if not indeed obliged, to bring the lens of science to bear on these harrowing mysteries.</p>
<p>To embrace this dark season, you are invited to the annual Oxford Internet Institute Halloween Lecture.</p>
<p>This year we will pursue one of the world’s most notorious cryptozoological phenomena, investigating over a century of data regarding sightings of the Sasquatch, or Bigfoot, of North America. In which regions are these cryptic precursors of humanity most commonly observed? What factors, whether environmental or physical, create habitats most suitable for the Sasquatch to thrive, hidden from the encroaching pressures of humanity? Where, were we bold enough to look, might we seek other populations of these elusive creatures?</p>
<p>In this lecture we will examine some of the history surrounding sightings of Bigfoot, and related cryptids. We will impetuously apply statistical methods to derive underlying patterns from reported sightings, and heedlessly strive to uncover their meaning and implications. What can we learn from the accumulated data about the habits of cryptic species living on the fringes of our world? Is the beast, as ever, closer to us than we wish to believe?</p></blockquote>
<a href="https://www.weirddatascience.net/wp-content/uploads/2025/07/sasquatch_distribution_modelling.pdf" class="pdfemb-viewer" style="" data-width="max" data-height="max" data-mobile-width="500"  data-scrollbar="none" data-download="on" data-tracking="on" data-newwindow="on" data-pagetextbox="off" data-scrolltotop="off" data-startzoom="100" data-startfpzoom="100" data-toolbar="bottom" data-toolbar-fixed="off">sasquatch_distribution_modelling<br/></a>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2025/07/21/numbers-of-the-beast-sasquatch-distribution-modelling/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">9482</post-id>	</item>
		<item>
		<title>Readings from the Book</title>
		<link>https://www.weirddatascience.net/2024/01/28/readings-from-the-book/</link>
					<comments>https://www.weirddatascience.net/2024/01/28/readings-from-the-book/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Sun, 28 Jan 2024 16:27:02 +0000</pubDate>
				<category><![CDATA[beyond the veil]]></category>
		<category><![CDATA[bibliophilia]]></category>
		<category><![CDATA[event]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[stan]]></category>
		<guid isPermaLink="false">https://www.weirddatascience.net/?p=5768</guid>

					<description><![CDATA[<div class="mh-excerpt">Once again, the Oxford Internet Institute at the University of Oxford -- through madness, or through omission brought on by horrified incredulity -- saw fit to expose its students to the nightmarish patterns that descend, fractal-like, endlessly below the surface of mundane reality. This second OII Halloween Lecture drew on the twisted meanderings we travellers have taken through the cryptic verbiage of the Voynich Manuscript.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2024/01/28/readings-from-the-book/" title="Readings from the Book">[...]</a>]]></description>
										<content:encoded><![CDATA[<p>Once again, the Oxford Internet Institute at the University of Oxford &#8212; through madness, or through omission brought on by horrified incredulity &#8212; saw fit to expose its students to the nightmarish patterns that descend, fractal-like, endlessly below the surface of mundane reality.</p>
<p>This second OII Halloween Lecture drew on the twisted meanderings we travellers have taken through the cryptic verbiage of the Voynich Manuscript. We aim to establish the dread authenticity of the text, by rousing its very statistical bones from the inscrutable fasciae of its pages. Walking a tightrope between careful statistical exploration and ever-burgeoning insanity, we further explore the structures that arise from the text, separating the untranslated knowledge in the book into coherent bodies for future study.</p>
<p>In yet another, almost criminially negligent, oversight, the OII&#8217;s 2024 Halloween Lecture was captured, frozen in space and time, for the detriment and despair of the unexpectant world.</p>
<p><a href="https://www.youtube.com/watch?v=nl7QRWIRcSk"><img decoding="async" src="https://img.youtube.com/vi/nl7QRWIRcSk/2.jpg" alt="Readings from the Book"></a></p>
<p><a href="https://www.youtube.com/watch?v=nl7QRWIRcSk">Click here to view the video on YouTube</a>.</p>

<p>&nbsp;</p>
<p>For those not driven to blissful negation by the tortured ramblings of the above, the underlying materials for the talk are presented, with neither hope nor tremor, here.</p>
<a href="https://www.weirddatascience.net/wp-content/uploads/2024/01/illuminating_the_illuminated.pdf" class="pdfemb-viewer" style="" data-width="max" data-height="max" data-mobile-width="500"  data-scrollbar="none" data-download="on" data-tracking="on" data-newwindow="on" data-pagetextbox="off" data-scrolltotop="off" data-startzoom="100" data-startfpzoom="100" data-toolbar="bottom" data-toolbar-fixed="off">illuminating_the_illuminated<br/></a>
<p>&nbsp;</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2024/01/28/readings-from-the-book/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5768</post-id>	</item>
		<item>
		<title>Whisperings in the Academy</title>
		<link>https://www.weirddatascience.net/2022/11/20/whisperings-in-the-academy/</link>
					<comments>https://www.weirddatascience.net/2022/11/20/whisperings-in-the-academy/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Sun, 20 Nov 2022 13:12:43 +0000</pubDate>
				<category><![CDATA[event]]></category>
		<category><![CDATA[maps]]></category>
		<category><![CDATA[spatial analysis]]></category>
		<category><![CDATA[stan]]></category>
		<category><![CDATA[ufo]]></category>
		<guid isPermaLink="false">https://www.weirddatascience.net/?p=4385</guid>

					<description><![CDATA[<div class="mh-excerpt">The noblest of human endeavours is to enlighten the uninitiated consciousness; to bare its awareness before the endless and terrifying vistas that lie beyond darkness and ignorance.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2022/11/20/whisperings-in-the-academy/" title="Whisperings in the Academy">[...]</a>]]></description>
										<content:encoded><![CDATA[<p>The noblest of human endeavours is to enlighten the uninitiated consciousness; to bare its awareness before the endless and terrifying vistas that lie beyond darkness and ignorance.</p>
<p>In pursuit of such necessarily painful revelations the Oxford Internet Insitute at the University of Oxford &#8212; the unwitting host on which the investigations here parasitise &#8212; recently hosted an inaugural Halloween lecture. This oration drew on several years of dark explorations chronicled in this blog, to inculcuate into a new generation of unprepared and curious minds the horror and necessity of subjecting our reality to the insidious power of statistical science. Through what seems a dangerously careless oversight, this brief glimpse of truth was recorded and made available for posterity.</p>
<p><a href="https://www.youtube.com/watch?v=qaBYjnXbnWE"><img decoding="async" src="https://img.youtube.com/vi/qaBYjnXbnWE/2.jpg" alt="Whisperings in the Academy"></a></p>
<p><a href="https://www.youtube.com/watch?v=qaBYjnXbnWE">Click here to view the video on YouTube</a>.</p>

<p>&nbsp;</p>
<p>For the terminally inquisitive, the archival materials on which this work was drawn are presented here.</p>
<a href="https://www.weirddatascience.net/wp-content/uploads/2022/11/bayes_vs_invaders.pdf" class="pdfemb-viewer" style="" data-width="max" data-height="max" data-mobile-width="500"  data-scrollbar="none" data-download="on" data-tracking="on" data-newwindow="on" data-pagetextbox="off" data-scrolltotop="off" data-startzoom="100" data-startfpzoom="100" data-toolbar="bottom" data-toolbar-fixed="off">bayes_vs_invaders<br/></a>
<p>&nbsp;</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2022/11/20/whisperings-in-the-academy/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4385</post-id>	</item>
		<item>
		<title>Illuminating the Illuminated – Part Four: Tempora Mutantur &#124; Changepoint Analysis of the Voynich Manuscript</title>
		<link>https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/</link>
					<comments>https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/#comments</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Fri, 21 Feb 2020 10:31:38 +0000</pubDate>
				<category><![CDATA[bibliophilia]]></category>
		<category><![CDATA[cryptology]]></category>
		<category><![CDATA[stan]]></category>
		<guid isPermaLink="false">https://www.weirddatascience.net/?p=2009</guid>

					<description><![CDATA[<div class="mh-excerpt">Our past interrogation of the Voynich Manuscript has deconstructed its esoteric symbols into a form more suitable for our ends, subjected its statistical properties to comparison with more mundane texts, and unearthed its hidden internal structures via the esoteric process of topic modelling. In this final post, we <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/" title="Illuminating the Illuminated – Part Four: Tempora Mutantur &#124; Changepoint Analysis of the Voynich Manuscript">[...]</a></div>]]></description>
										<content:encoded><![CDATA[<p>Our past interrogation of the Voynich Manuscript has <a href="https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/">deconstructed its esoteric symbols into a form more suitable for our ends</a>, <a href="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/">subjected its statistical properties to comparison with more mundane texts</a>, and <a href="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/">unearthed its hidden internal structures via the esoteric process of topic modelling</a>. In this final post, we will build on the structures revealed in earlier posts to ask how, if at all, the Voynich Manuscript&#8217;s textual properties <em>shift</em> within the text itself. Are there significant discontinuities in the writing, indicating a separation of the manuscript into meaningful sections? Or is the text merely a homogenous mass more suggestive of a rote, mechanical, generative procedure?</p>
<p>To address this question we will delve once more into the arcana of machine learning, and draw out the technique of <em>changepoint analysis</em>. This procedure aims to identify one or more points in a series of observations at which the underlying process that generates the data has somehow altered.</p>
<p>Once more, we will operate within the warm embrace of Bayesian statistics and exploit the <a href="https://mc-stan.org/">Stan modelling language</a> as our means to cast light into the darkness<span id='easy-footnote-1-2009' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/#easy-footnote-bottom-1-2009' title='As of this post, future work will be accompanied with public code and data available at &lt;a href=&quot;https://github.com/weirddatascience/weirddatascience&quot;&gt;https://github.com/weirddatascience/weirddatascience&lt;/a&gt;. As numerous of our routines, particularly for plot generation, are repeated, future code will also rely on our custom R package: &lt;a href=&quot;https://github.com/weirddatascience/grimoire&quot;&gt;&lt;code&gt;grimoire&lt;/code&gt;&lt;/a&gt;. We will also be dredging through our dark history to make the code for previous posts public in the near future.'><sup>1</sup></a></span>.</p>
<h2>On Shifting Sands</h2>
<p>Changepoint analysis is an active field of endeavour, with deep subtleties in its application. For the purposes of this analysis we will focus on the comparatively simple problem of identifying a single changepoint amongst the teeming mass of the Voynich Manuscript&#8217;s strangely compelling glyphs.</p>
<p>Statistical analysis of changepoints has been applied to a number of historical texts. To provide a first, tantalising glimpse into a world characterised by authorship analysis and the dark arts of <a href="https://www1.icsi.berkeley.edu/~sadia/papers/adversarial_stylometry.pdf">adversarial stylometry</a>, we begin by reproducing the work of <a href="https://www.tandfonline.com/doi/abs/10.1080/0266476052000330295">Riba and Ginebra</a> in ascertaining the existence of a shift in authorship in the 15th century Catalan chivalric romance <a href="http://www.gutenberg.org/ebooks/378">Tirant lo Blanc</a>.</p>
<p>Briefly, Tirant lo Blanc was written by <a href="https://en.wikipedia.org/wiki/Joanot_Martorell">Joanot Martorel</a>, a Valencian knight, whose untimely death left the manuscript unfinished. The work was completed and published by <a href="https://en.wikipedia.org/wiki/Mart%C3%AD_Joan_de_Galba">Martí Joan de Galba</a>. The specific nature of his contributions have, however, been the subject of some debate: did he substantially compose parts of the text, or simply arrange and edit the work?</p>
<p>Riba and Ginebra proposed, in 2005, to identify any stylistic change in the work through a Bayesian analysis of the frequency of word lengths in the document. Their approach, built on a tradition of such analyses in the stylometry literature, relies on the fact that differing authors, even when attempting to mimic the style of another, unconsciously make different word choices. Most importantly, the relative frequency of shorter <em>context free</em> words is likely to differ between authors.</p>
<p>From a certain analytical perspective, each word in a sequence of text can be represented merely by its length, as in the analyses of our <a href="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/">earlier posts</a>. Taking this view to a probabilistic extreme, a text, therefore, can be considered as a sequence of draws from a <a href="https://en.wikipedia.org/wiki/Categorical_distribution"><em>categorical distribution</em></a> of word lengths. We may consider the length of each word as resulting from rolls of some abstract, biased die with a number of sides equal to the number of possible word lengths in the prose. The full text, therefore, is itself a <a href="https://en.wikipedia.org/wiki/Multinomial_distribution"><em>multinomial distribution</em></a> in which the number of categories matches the possible lengths of words observed in the document, and the number of trials is the number of words in the text. The stylistic differences between authors, therefore, is known to be revealed most strongly in the relative frequency of the shorter words.</p>
<p>Having descended thus far, and with dark suspicions of an authorship change disrupting the contiguity of Martorel&#8217;s opus, we hypothesise that the entire volume may best be described by not one, but by two multinomial distributions over short word lengths: one for the earlier text of the original author, with a second describing the later contributions of his posthumous collaborator. The point of division between these distributions is the changepoint.</p>
<h2>Winds of Change</h2>
<p>To drag this concept from its abstract formulation to a tangible realisation, we turn to the Stan probabilistic programming language, The models here draw heavily on the <a href="https://mc-stan.org/docs/2_22/stan-users-guide/index.html">Stan user&#8217;s guide</a> <a href="https://mc-stan.org/docs/2_22/stan-users-guide/change-point-section.html">changepoint section</a>, which lays out the concepts underlying these approaches with horrifying clarity.</p>
<p>Perhaps the most unusual aspect of implementing this model is due to Stan&#8217;s inability to sample <em>discrete</em> parameters; in this case the location of the changepoint. As such the model must conceal a <em>latent</em> discrete parameter cunningly hidden in its construction. This may then be marginalised out to reveal the probability of each value of the latent parameter. The model we construct here, therefore, will provide us not simply with a <em>point estimate</em> of the most likely changepoint, but a set of probabilities for each possible changepoint.</p>

<p>As cryptically hinted above, our model of the Voynich text abstracts its prose to counts of word lengths found on each folio of the manuscript, resulting in a multinomial distribution of word lengths. For the changepoint, we hypothesise that there are not one, but two multinomials with different parameters falling on either side of some changepoint. Our goal is to identify the pivotal folio at which this underlying shift most likely occurs.</p>
<p>Representing this more formally, in the style used for generative Bayesian models, we can write the distribution of word lengths, &#92;(\Omega&#92;), in terms of its hyperparameters as:</p>
<p>$$\begin{eqnarray}&#92;<br />
\Omega &amp;\sim&amp; \mathbf{Multinomial}( t &lt; c~?~\theta_e : \theta_l )&#92;&#92;<br />
\theta_e &amp;\sim&amp; \mathbf{Dirichlet}(\alpha)&#92;&#92;<br />
\theta_l &amp;\sim&amp; \mathbf{Dirichlet}(\alpha)&#92;&#92;<br />
c &amp;\sim&amp; \mathbf{Uniform}(0, 1)<br />
\end{eqnarray}$$</p>
<p>The most crucial element for isolating the changepoint is the <a href="https://mc-stan.org/docs/2_18/reference-manual/conditional-operator-section.html">conditional operator </a> in the first line. We treat the frequency of words of different lengths on a given folio as one point in a sequence, indexed by the value &#92;(t&#92;). The conditional statement encodes that observed word lengths prior to some unknown point &#92;(t = c&#92;) are drawn from a multinomial with one vector of parameters, &#92;(\theta_e&#92;); from that folio onwards, word lengths are drawn according to a second multinomial with parameter vector &#92;(\theta_l&#92;). When appropriately constructed, fitting the model produces a posterior distribution across the various possible &#92;(\theta_e&#92;) and &#92;(\theta_l&#92;) parameters for all possible changepoints. The folio at which the posterior probability of &#92;(c&#92;) is highest is our best estimate of the changepoint, and is accompanied by estimates of &#92;(\theta_e&#92;) and &#92;(\theta_l&#92;) around that changepoint.</p>
<p>As mentioned above, we <em>marginalise</em> the discrete changepoint parameter, &#92;(c&#92;), rather than sampling it directly. A full example of the concept, applied to a Poisson distribution, is given by the Stan Users&#8217; Guide <a href="https://mc-stan.org/docs/2_22/stan-users-guide/change-point-section.html">changepoint section</a>. This key step moves the parameter &#92;(c&#92;) from the full joint probability function of the model, resulting in a likelihood of word lengths according to the parameters &#92;(\theta_e&#92;) and &#92;(\theta_l&#92;), which can be calculated in Stan by summing over this likelihood for all possible values of &#92;(c&#92;)<span id='easy-footnote-2-2009' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/#easy-footnote-bottom-2-2009' title='The marginalisation here is given as a sum, rather than a product, due to Stan operating on the log scale for reasons of computational efficiency and numerical stability.'><sup>2</sup></a></span>.</p>
<p>Reproducing, with only slight adaptations, the original Poisson example, our full joint probability would be:</p>
<p>$$p(\theta_e, \theta_l, c, \Omega) =<br />
p(\theta_e)p(\theta_l)p(c)p(\Omega|\theta_e,\theta_l,c)$$</p>
<p>Marginalising out &#92;(c&#92;), this can be represented as:</p>
<p>$$\begin{eqnarray}<br />
p(\Omega|\theta_e, \theta_l) &amp;=&amp; \sum_{c=1}^Tp(c,\Omega|c,\theta_e,\theta_l)&#92;&#92;<br />
&amp;=&amp; \sum_{c=1}^Tp(c)p(\Omega|c,\theta_e,\theta_l)<br />
\end{eqnarray}$$</p>
<p>The result is that our Stan model can be constructed by sampling across values of &#92;(\theta_e&#92;) and &#92;(\theta_l&#92;) for all possible values of &#92;(c&#92;). Due to the requirement to sum across all possible values of the discrete paramete, however, this subterfuge of marginalisation is restricted in general to <em>bounded</em> discrete parameters.</p>
<p>We can distill the above into a Stan model as given below.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Multinomial changepoint model</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>multinomial_changepoint.stan</code><br />
[code language=&#8221;c&#8221;]
<p>data {</p>
<p>	int&lt;lower=0&gt; num_obs;				// Number of observations (rows/pages) in data.<br />
	int&lt;lower=0&gt; num_cats;				// Number of categories in data.<br />
	int y[num_obs, num_cats];  			// Matrix of observations.</p>
<p>	vector&lt;lower=0&gt;[num_cats] alpha;		// Dirichlet prior values.</p>
<p>}</p>
<p>transformed data {</p>
<p>	// Uniform prior across all time points for changepoint.<br />
	real log_unif;<br />
	log_unif = -log(num_obs);</p>
<p>}</p>
<p>parameters {</p>
<p>	// Two sets of parameters.<br />
	// One (early) before changepoint, one (late) for after.<br />
	simplex[num_cats] theta_e;<br />
	simplex[num_cats] theta_l;</p>
<p>}</p>
<p>transformed parameters {</p>
<p>	//	// This code shows a slower, but easier to understand updating of log posterior via summation.<br />
	//  vector[num_obs] lp;<br />
	//  lp = rep_vector(log_unif, num_obs);<br />
	//  for (s in 1:num_obs)<br />
	//    for (t in 1:num_obs)<br />
	//      lp[s] = lp[s] + multinomial_lpmf(y[t,] | t &lt; s ? theta_e : theta_l);</p>
<p>	// This approach relies on dynamic programming to reduce runtime from quadratic to linear in num_obs.<br />
	// See &lt;https://mc-stan.org/docs/2_19/stan-users-guide/change-point-section.html&gt;<br />
		vector[num_obs] log_p;<br />
	{<br />
		vector[num_obs + 1] log_p_e;<br />
		vector[num_obs + 1] log_p_l;</p>
<p>		log_p_e[1] = 0;<br />
		log_p_l[1] = 0;</p>
<p>		for( i in 1:num_obs ) {<br />
			log_p_e[i + 1] = log_p_e[i] + multinomial_lpmf(y[i,] | theta_e );<br />
			log_p_l[i + 1] = log_p_l[i] + multinomial_lpmf(y[i,] | theta_l );<br />
		}</p>
<p>		log_p =<br />
			rep_vector( -log(num_obs) + log_p_l[num_obs + 1], num_obs) +<br />
			head(log_p_e, num_obs) &#8211; head(log_p_l, num_obs);<br />
	}<br />
}</p>
<p>model {</p>
<p>	// Priors<br />
	theta_e ~ dirichlet( alpha );<br />
	theta_l ~ dirichlet( alpha );</p>
<p>	target += log_sum_exp( log_p );</p>
<p>}</p>
<p>generated quantities {</p>
<p>	simplex[num_obs] changepoint_simplex;	// Simplex of locations for changepoint.</p>
<p>	// Convert the log posterior to a simplex.<br />
	changepoint_simplex = softmax( log_p );</p>
<p>}</p>
[/code]
</div></div>
</div>
<p>To launch the model in gentler waters, we apply it to Martorel and de Galba&#8217;s <em>Tirant lo Blanc</em> to discover where, if at all, de Galba&#8217;s major contributions to the text begin, and see the extent to which our Stan rendering reproduces the results of Riba and Ginebra&#8217;s analysis. As in their work we will focus only on those words with length shorter than 5 letters, where the key <em>stylistic</em> difference between authors makes itself apparent.</p>
<p>We place a relative uninformative <a href="https://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet prior</a> on the multinomial &#92;(\theta_e&#92;) and &#92;(\theta_l&#92;) parameters, with the vector of &#92;(\alpha&#92;) values set to one. This results in a uniform distribution across all possible simplexes. Note that this does <em>not</em> push the multinomial towards a uniform simplex, but instead that all possible simplexes are equally likely. &#92;(\alpha \ge 1&#92;) produces increasingly uniform simplexes; &#92;(0 \le \alpha \le 1&#92;) produces simplexes in which the probability mass is more likely to be concentrated in some given element.</p>
<p>With the resulting data and model in unison, we can see the results of this analytic process.</p>
<figure id="attachment_2035" aria-describedby="caption-attachment-2035" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot.png"><img fetchpriority="high" decoding="async" data-attachment-id="2035" data-permalink="https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/multinomial_changepoint_tirant_plot-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="multinomial_changepoint_tirant_plot.png" data-image-description="&lt;p&gt;Tirant lo Blanc Changepoint&lt;/p&gt;
" data-image-caption="&lt;p&gt;Tirant lo Blanc Changepoint&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot.png" alt="Tirant lo Blanc Changepoint" width="1920" height="1080" class="size-full wp-image-2035" srcset="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-678x381.png 678w" sizes="(max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-2035" class="wp-caption-text">Tirant lo Blanc Changepoint | (<a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot.pdf">PDF Version</a>)</figcaption></figure>
<p>Our multinomial automaton suggests a location for a significant stylistic changepoint in <em>Tirant lo Blanc</em> with the main concentration of probability mass around chapter 374. Perhaps unsurprisingly, but pleasingly, this is in close agreement with the earlier analysis of Riba and Ginebra, who placed their estimates between chapters 371 and 382.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span><em>Tirant lo Blanc</em> analysis code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code></code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( tidyselect )<br />
library( magrittr )</p>
<p>library( rstan )</p>
<p>library( tidytext )</p>
<p># Load Tirant data<br />
message( &quot;Reading raw Tirant data&#8230;&quot; )<br />
tirant_tbl &lt;-<br />
	read_csv( &quot;data/tirant_raw.csv&quot;, col_names=FALSE ) %&gt;%<br />
	rename( chapter = X1, text = X2 ) %&gt;%<br />
	mutate( page = as.numeric( rownames(.) ) )</p>
<p># Tokenize<br />
tirant_words &lt;-<br />
	tirant_tbl %&gt;%<br />
	unnest_tokens( word, text ) </p>
<p># Pivot the data wider to be presented to Stan as a matrix of multinomial samples.<br />
tirant_lengths &lt;-<br />
	tirant_words %&gt;%<br />
	mutate( word_length = str_length( word ) ) %&gt;%<br />
	mutate( word_length = ifelse( word_length &gt; 9, 10, word_length )) %&gt;%<br />
	group_by( page, word_length ) %&gt;%<br />
	summarise( count = n( )) %&gt;%<br />
	pivot_wider( names_from = word_length, values_from = count ) %&gt;%<br />
	ungroup %&gt;%<br />
	select( -c(page,&quot;5&quot;,&quot;6&quot;,&quot;7&quot;,&quot;8&quot;,&quot;9&quot;,&quot;10&quot;) ) %&gt;%<br />
	select(sort(peek_vars())) %&gt;%<br />
	replace( is.na(.), 0 )</p>
<p>if( not( file.exists( &quot;work/multinomial_changepoint_tirant_fit.rds&quot; ) ) ) {</p>
<p>	message( &quot;Fitting multinomial model.&quot;)<br />
	tirant_multinom_fit &lt;-<br />
		stan( &quot;multinomial_changepoint.stan&quot;,<br />
			  data=list(<br />
							num_obs=487,<br />
							num_cats=4,<br />
							y = as.matrix( tirant_lengths ),<br />
							alpha = rep( 1, 4 ) ),<br />
			  iter=16000,<br />
				control=list(<br />
							adapt_delta=0.98,<br />
							max_treedepth=15 ) )</p>
<p>	saveRDS( tirant_multinom_fit, &quot;work/multinomial_changepoint_tirant_fit.rds&quot; )</p>
<p>} else {<br />
	message( &quot;Loading saved multinomial model.&quot;)<br />
	tirant_multinom_fit &lt;- readRDS( &quot;work/multinomial_changepoint_tirant_fit.rds&quot; )<br />
}</p>
<p># Plot the calculated changepoint probabilities.<br />
# (&#8216;changepoint_simplex&#8217;).<br />
mean_changepoint_prob &lt;-<br />
	extract( tirant_multinom_fit )$changepoint_simplex %&gt;%<br />
	as_tibble( .name_repair=&quot;unique&quot; ) %&gt;%<br />
	summarise_all( mean ) %&gt;%<br />
	pivot_longer( everything() ) %&gt;%<br />
	rowid_to_column() </p>
<p># Save values for plotting<br />
saveRDS( mean_changepoint_prob, file=&quot;work/mean_changepoint_prob_tirant.rds&quot; )</p>
[/code]
</div></div>
</div>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span><em>Tirant lo Blanc</em> data file</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/tirant_raw.csv"><em>Tirant lo Blanc</em> formatted raw text data.</a><br />
</div></div>
</div>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span><em>Tirant lo Blanc</em> changepoint plot code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>multinomial_changepoint_tirant_plot.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( grimoire ) # &lt;https://github.com/weirddatascience/grimoire&gt;</p>
<p># Fonts<br />
font_add( &quot;main_font&quot;, &quot;resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p>mean_changepoint_prob &lt;-<br />
	readRDS( &quot;work/mean_changepoint_prob_tirant.rds&quot; )</p>
<p>changepoint_plot &lt;-<br />
	ggplot( mean_changepoint_prob ) +<br />
	geom_col( aes( x=rowid, y=value ), fill=weird_colours[&quot;blood&quot;] ) +<br />
	labs( x=&quot;Chapter&quot;, y=&quot;Probability of Changepoint&quot; ) +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.title = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=22 ),<br />
			plot.subtitle = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.text = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.title.x = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.title.y = element_text( family=&quot;bold_font&quot;, angle=90, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.line = element_line( colour=weird_colours[&quot;ink&quot;] ),<br />
			panel.grid.major.x = element_blank(),<br />
			panel.grid.major.y = element_blank(),<br />
			panel.grid.minor.x = element_blank(),<br />
			panel.grid.minor.y = element_blank()<br />
			) </p>
<p># grimoire::decorate_plot() from &lt;https://github.com/weirddatascience/grimoire&gt;<br />
parchment_plot &lt;-<br />
	decorate_plot(<br />
					  title=&quot;Tirant lo Blanc Chapter Changepoint Probability&quot;,<br />
					  subtitle=&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;,<br />
					  plot=changepoint_plot,<br />
					  bg_image=&quot;resources/img/parchment.jpg&quot;,<br />
					  footer=&quot;Data: http://einesdellengua.com/tirantloweb/tirantloblanch.html&quot; )</p>
<p>save_plot(&quot;output/multinomial_changepoint_tirant_plot.pdf&quot;,<br />
							parchment_plot,<br />
							base_width = 16,<br />
							base_height = 9,<br />
			           	base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<h2>Nos et mutamur in illis</h2>
<p>The changepoint model realises the dark imaginings of our predecessors. What horrors might it reveal when applied to the Voynich Manuscript? Following the logic of the above analysis, we will focus initially on the shorter words in the Voynich corpus.</p>
<p>To improve model fit we will place a little more information in the prior, setting the &#92;(\alpha&#92;) values for the Dirichlet prior to 0.6 to push the multinomial towards more concentrated probabilities. This reflects the gamma distribution of word length frequencies discussed in our <a href="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/">earlier analysis</a>. We also combine all single- and two-letter Voynich terms into a single category due to the small number of words falling into these categories<span id='easy-footnote-3-2009' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/#easy-footnote-bottom-3-2009' title='This second piece of wisdom, in particular, was provided, along with a host of other points of guidance, via the &lt;a href=&quot;https://discourse.mc-stan.org/&quot;&gt;Stan Forums&lt;/a&gt;. Particular thanks are due to forum members @bbales and @emiruz.'><sup>3</sup></a></span>.</p>
<p>The model can now reveal where, if at all, a likely fracture resides in the textual assemblage of the Voynich Manuscript.</p>
<figure id="attachment_2052" aria-describedby="caption-attachment-2052" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot.png"><img decoding="async" data-attachment-id="2052" data-permalink="https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/multinomial_changepoint_voynich_plot-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="multinomial_changepoint_voynich_plot" data-image-description="&lt;p&gt;Voynich Manuscript changepoint plot&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript changepoint plot&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot.png" alt="Voynich Manuscript Changepoint" width="1920" height="1080" class="size-full wp-image-2052" srcset="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot-678x381.png 678w" sizes="(max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-2052" class="wp-caption-text">Voynich Manuscript Changepoint | (<a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_plot.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript word length analysis code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>multinomial_changepoint_voynich.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( tidyselect )<br />
library( magrittr )</p>
<p>library( rstan )</p>
<p>library( tidytext )</p>
<p># Load Voynich data<br />
message( &quot;Reading raw Voynich data&#8230;&quot; )<br />
voynich_tbl &lt;-<br />
	read_csv( &quot;data/voynich_raw.csv&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Tokenize<br />
voynich_words &lt;-<br />
	voynich_tbl %&gt;%<br />
	unnest_tokens( word, text ) </p>
<p># Calculate the lengths of words<br />
voynich_pure_lengths &lt;-<br />
	voynich_words %&gt;%<br />
	transmute( word_length = str_length( word ) ) </p>
<p>voynich_pure_lengths$count &lt;- 1</p>
<p># Pivot the data wider to be presented to Stan as a matrix of multinomial samples.<br />
voynich_lengths &lt;-<br />
	voynich_words %&gt;%<br />
	mutate( word_length = str_length( word ) ) %&gt;%<br />
	mutate( word_length = ifelse( word_length &gt; 8, 9, word_length )) %&gt;%<br />
	mutate( word_length = ifelse( word_length &lt; 2, 2, word_length )) %&gt;%<br />
	group_by( folio, word_length ) %&gt;%<br />
	summarise( count = n( )) %&gt;%<br />
	pivot_wider( names_from = word_length, values_from = count ) %&gt;%<br />
	ungroup %&gt;%<br />
	select( -c(&quot;folio&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;, &quot;8&quot;, &quot;9&quot; )) %&gt;%<br />
	select(sort(peek_vars())) %&gt;%<br />
	replace( is.na(.), 0 )</p>
<p>if( not( file.exists( &quot;work/multinomial_changepoint_voynich_fit.rds&quot; ) ) ) {</p>
<p>	message( &quot;Fitting multinomial model.&quot;)<br />
	voynich_seed &lt;- 1912<br />
	num_cats &lt;- ncol( voynich_lengths )<br />
	voynich_multinomial_fit &lt;-<br />
		stan( &quot;multinomial_changepoint.stan&quot;,<br />
			  data=list(<br />
							num_obs=226,<br />
							num_cats=num_cats,<br />
							y = as.matrix( voynich_lengths ),<br />
							alpha = rep( 0.6, num_cats ) ),<br />
			  chains=4,<br />
			  iter=8000, seed=voynich_seed,<br />
			  control = list( adapt_delta=0.99,<br />
								  	max_treedepth=12 ) )</p>
<p>	saveRDS( voynich_multinomial_fit, &quot;work/multinomial_changepoint_voynich_fit.rds&quot; )</p>
<p>} else {<br />
	message( &quot;Loading saved multinomial model.&quot;)<br />
	voynich_multinomial_fit &lt;- readRDS( &quot;work/multinomial_changepoint_voynich_fit.rds&quot; )<br />
}</p>
<p># Plot the calculated changepoint probabilities.<br />
# (&#8216;changepoint_simplex&#8217;).<br />
mean_changepoint_prob &lt;-<br />
	extract( voynich_multinomial_fit )$changepoint_simplex %&gt;%<br />
	as_tibble( .name_repair=&quot;unique&quot; ) %&gt;%<br />
	summarise_all( mean ) %&gt;%<br />
	pivot_longer( everything() ) %&gt;%<br />
	rowid_to_column() </p>
<p># Plot the calculated changepoint probabilities.<br />
# (&#8216;changepoint_simplex&#8217;).<br />
mean_log_p &lt;-<br />
	extract( voynich_multinomial_fit )$log_p %&gt;%<br />
	as_tibble( .name_repair=&quot;unique&quot; ) %&gt;%<br />
	summarise_all( mean ) %&gt;%<br />
	pivot_longer( everything() ) %&gt;%<br />
	rowid_to_column() </p>
<p># Save mean values for plotting<br />
saveRDS( mean_changepoint_prob, file=&quot;work/mean_changepoint_prob_voynich.rds&quot; )</p>
[/code]
</div></div>
</div>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript data file</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_raw.csv">Voynich Manuscript formatted raw text data.</a></p>
</div></div>
</div>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript changepoint plot code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>multinomial_changepoint_voynich_plot.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( cowplot )<br />
library( grimoire ) # https://github.com/weirddatascience/grimoire</p>
<p># Fonts<br />
font_add( &quot;voynich_font&quot;, &quot;resources/fonts/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p>mean_changepoint_prob &lt;-<br />
	readRDS( &quot;work/mean_changepoint_prob_voynich.rds&quot; )</p>
<p>changepoint_plot &lt;-<br />
	ggplot( mean_changepoint_prob ) +<br />
	geom_col( aes( x=rowid, y=value ), fill=weird_colours[&quot;blood&quot;] ) +<br />
	labs( x=&quot;Folio&quot;, y=&quot;Probability of Changepoint&quot; ) +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.title = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=22 ),<br />
			plot.subtitle = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.text = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.title.x = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.title.y = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], angle=90, size=12 ),<br />
			axis.line = element_line( colour=weird_colours[&quot;ink&quot;] ),<br />
			panel.grid.major.x = element_blank(),<br />
			panel.grid.major.y = element_blank(),<br />
			panel.grid.minor.x = element_blank(),<br />
			panel.grid.minor.y = element_blank()<br />
			) </p>
<p># grimoire::decorate_plot() from &lt;https://github.com/weirddatascience/grimoire&gt;<br />
parchment_plot &lt;-<br />
	decorate_plot(<br />
					  title=&quot;Voynich Folio Word-Length Changepoint&quot;,<br />
					  subtitle=&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;,<br />
					  plot=changepoint_plot,<br />
					  bg_image=&quot;resources/img/parchment.jpg&quot;,<br />
					  footer=&quot;Data: http://www.voynich.nu&quot; )</p>
<p>save_plot(&quot;output/multinomial_changepoint_voynich_plot.pdf&quot;,<br />
							parchment_plot,<br />
							base_width = 16,<br />
							base_height = 9,<br />
			           	base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<p>Our model appears to identify a potential changepoint in the frequency of short words in the Voynich Manuscript somewhere around Folio 33.</p>
<p>In contrast to the analysis of <em>Tirant lo Blanc</em>, we have no <em>a priori</em> suspicion that multiple authors were involved in the creation of the Voynich Manuscript. As such, we may hypothesise this changepoint as reflecting a simple stylistic shift, a shift in content, or, as with the previous analysis, a shift in authorship.</p>
<h2>Mutatis Mutandis</h2>
<p>Recalling the topic model from the <a href="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/">previous post</a>, and the manually-assigned topics hinted at by the diagrams in the manuscript, Folio 33 might not immediately arouse our suspicions as the most obvious candidate for such a shift in style, falling as it does some way through the manually-identified herbal section, and without an immediately apparent shift in the distribution of topics.</p>
<p>For simplicity of presentation and analysis we will work with the alternative 12-topic model suggested by the metrics in the previous post rather than the 34-topic model initially given there<span id='easy-footnote-4-2009' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/#easy-footnote-bottom-4-2009' title='We have reproduced the analyses in this post for the 34-topic model with similar results.'><sup>4</sup></a></span>. The distribution of topics in this model can be presented as those of the 34-topic model were in our previous post.</p>
<figure id="attachment_2073" aria-describedby="caption-attachment-2073" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12.png"><img decoding="async" data-attachment-id="2073" data-permalink="https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/voynich_folio_topic_heatmap-12-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="voynich_folio_topic_heatmap-12" data-image-description="&lt;p&gt;Voynich folio-topic heatmap&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich folio-topic heatmap&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12.png" alt="Voynich Manuscript Folio Topic Heatmap" width="1920" height="1080" class="size-full wp-image-2073" srcset="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12-678x381.png 678w" sizes="(max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-2073" class="wp-caption-text">Voynich Manuscript Folio Topic Heatmap | (<a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_topic_heatmap-12.pdf">PDF Version</a>)</figcaption></figure>
<p>We might, then, ask whether the distribution of topics from this model itself has a changepoint. Having interrogated the distribution of word length frequencies, we can pose the same questions to the distribution of assignments produced by the topic model. Similarly to above: were we to conceive of the topic model assignment for each page as being the result of the roll of some biased die, is there a notable point in the document where the bias of that die seems to shift?</p>
<p>Framed as such, there is little more required to apply this model to the topic assignments. Our Stan model, dredged from its slumber, merely needs to be provided with the topic model folio assignment data.</p>
<figure id="attachment_2061" aria-describedby="caption-attachment-2061" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot.png"><img loading="lazy" decoding="async" data-attachment-id="2061" data-permalink="https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/multinomial_changepoint_voynich_topic-12_plot-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="multinomial_changepoint_voynich_topic-12_plot" data-image-description="&lt;p&gt;Voynich Manuscript topic model changepoint&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript topic model changepoint&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot.png" alt="Voynich Manuscript topic model changepoint" width="1920" height="1080" class="size-full wp-image-2061" srcset="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-2061" class="wp-caption-text">Voynich Manuscript topic model changepoint | (<a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript topic model changepoint code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>multinomial_changepoint_voynich_topics.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( tidyselect )<br />
library( magrittr )</p>
<p>library( rstan )<br />
library( tidytext )</p>
<p>library( gtools )		# (Specifically for `mixedsort` to sort column names numerically.</p>
<p># Number of topics in model<br />
num_topics &lt;- 12</p>
<p># Load Voynich topic model data<br />
message( &quot;Reading Voynich topic model data&#8230;&quot; )<br />
voynich_tbl &lt;-<br />
	readRDS( paste0( &quot;work/topic_identity-&quot;, num_topics, &quot;.rds&quot; )) %&gt;%<br />
	select( -c( &quot;gamma&quot;, &quot;section&quot; ) ) %&gt;%<br />
	ungroup</p>
<p># Pivot the data wider to be presented to Stan as a matrix of samples from a multinomial.<br />
voynich_lengths &lt;-<br />
	voynich_tbl %&gt;%<br />
	mutate( count=1 ) %&gt;%<br />
	pivot_wider( names_from = topic, values_from = &quot;count&quot; ) %&gt;%<br />
	select( -document ) %&gt;%<br />
	select(mixedsort(peek_vars())) %&gt;%<br />
	replace( is.na(.), 0 )</p>
<p>topic_fit_file &lt;- paste0( &quot;work/multinomial_changepoint_voynich_topic_fit-&quot;, num_topics, &quot;.rds&quot; )<br />
if( not( file.exists( topic_fit_file ) ) ) {</p>
<p>	message( &quot;Fitting multinomial model.&quot;)<br />
	voynich_topic_multinom_fit &lt;-<br />
		stan( &quot;multinomial_changepoint.stan&quot;,<br />
			  data=list(<br />
							num_obs=226,<br />
							num_cats=num_topics,<br />
							y = as.matrix( voynich_lengths ),<br />
							alpha = rep( 1, num_topics ) ),<br />
			  iter=8000,<br />
			  seed=19300319,<br />
			  control=list( adapt_delta=0.9 ) )<br />
	saveRDS( voynich_multinom_fit, topic_fit_file )</p>
<p>} else {<br />
	message( &quot;Loading saved multinomial model.&quot;)<br />
	voynich_multinom_fit &lt;- readRDS( topic_fit_file )<br />
}</p>
<p># Extract the calculated changepoint probabilities to a simplex.<br />
mean_changepoint_prob &lt;-<br />
	extract( voynich_multinom_fit )$changepoint_simplex %&gt;%<br />
	as_tibble( .name_repair=&quot;unique&quot; ) %&gt;%<br />
	summarise_all( mean ) %&gt;%<br />
	pivot_longer( everything() ) %&gt;%<br />
	rowid_to_column() </p>
<p># Save values for plotting<br />
saveRDS( mean_changepoint_prob, file=paste0( &quot;work/mean_changepoint_prob_voynich_topic-&quot;, num_topics, &quot;.rds&quot; ) )</p>
[/code]
</div></div>
</div>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript topic model changepoint plot code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>multinomial_changepoint_voynich_topics_plot.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( cowplot )<br />
library( grimoire ) # https://github.com/weirddatascience/grimoire</p>
<p># Fonts<br />
font_add( &quot;voynich_font&quot;, &quot;resources/fonts/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># Specify the number of topics in the model<br />
num_topics &lt;- 12</p>
<p>mean_changepoint_prob &lt;-<br />
	readRDS( paste0(&quot;work/mean_changepoint_prob_voynich_topic-&quot;, num_topics, &quot;.rds&quot; ) )</p>
<p>changepoint_plot &lt;-<br />
	ggplot( mean_changepoint_prob ) +<br />
	geom_col( aes( x=rowid, y=value ), fill=weird_colours[&quot;blood&quot;] ) +<br />
	labs( x=&quot;Folio&quot;, y=&quot;Probability of Changepoint&quot; ) +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.title = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=22 ),<br />
			plot.subtitle = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.text = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.title.x = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], size=12 ),<br />
			axis.title.y = element_text( family=&quot;bold_font&quot;, colour=weird_colours[&quot;ink&quot;], angle=90, size=12 ),<br />
			axis.line = element_line( colour=weird_colours[&quot;ink&quot;] ),<br />
			panel.grid.major.x = element_blank(),<br />
			panel.grid.major.y = element_blank(),<br />
			panel.grid.minor.x = element_blank(),<br />
			panel.grid.minor.y = element_blank()<br />
			) </p>
<p># grimoire::decorate_plot() from &lt;https://github.com/weirddatascience/grimoire&gt;<br />
parchment_plot &lt;-<br />
	decorate_plot(<br />
					  title=paste0( &quot;Voynich Folio Changepoint Probability &#8211; Topic Model (&quot;, num_topics, &quot; topics)&quot;),<br />
					  subtitle=&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;,<br />
					  plot=changepoint_plot,<br />
					  bg_image=&quot;resources/img/parchment.jpg&quot;,<br />
					  footer=&quot;Data: http://www.voynich.nu&quot;,<br />
						rel_heights=c(0.1, 1, 0.05 ))</p>
<p>save_plot( paste0(&quot;output/multinomial_changepoint_voynich_topic-&quot;, num_topics, &quot;_plot.pdf&quot; ),<br />
							parchment_plot,<br />
							base_width = 16,<br />
							base_height = 9,<br />
			           	base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<p>The location suggested for the topic model changepoint are surprisingly close to the results of the changepoint for the frequency counts of short words in the Voynich Manuscript. Whilst the changepoint probabilities are somewhat more diffuse in the topic model analysis, the most significant probability mass is centered around Folio 38, with much lower spikes extending out as far as Folio 55 and Folio 30.</p>
<p>In addition to the mutual support that these two analyses provide, it is notable that the major changpoint identified by both falls directly in the earlier portion of the first, major &#8220;herbal&#8221; section identified manually by Voynich scholars through inspection of the images accompanying the text. This suggests that that first section, at least in terms of textual content, is not as homogeneous as has previously been suggested. Future scholars investigating the structure of the Voynich Manuscript may therefore wish to direct more attention towards the earlier middle of the herbal section, around folios 30 to 40, to identify what dreadful changes may emerge at that point in the text.</p>
<p>We might naturally, but do not here, extend this analysis by shattering further the smooth unity of the Manuscript according to <em>multiple</em> changepoints. Such an extension is, as intimated in the <a href="https://mc-stan.org/docs/2_21/stan-users-guide/change-point-section.html">guide</a>, conceptually simple but computationally burdensome, as it requires recalculation of multiple potential distributions across an ever increasing number of parameters. As such, we leave this analysis, and the means to conduct it more efficiently, for the dim future.</p>
<h2>Omnia mutantur, nihil interit</h2>
<p>Our analysis has provided us with an abstracted location in the text at which our unsettling suspicions of change lie. The two folios arousing greatest curiosity are therefore Folio 33 and Folio 38, which we present here to reify our horror. It is worth highlighting, however, that the changepoint analysis says nothing specific about these two folios; the model identifies that the inexplicable scrawling prior to the changepoint differs significantly from the maddeningly incomprehensible glyphs following it, nothing more.</p>
<figure id="attachment_2066" aria-describedby="caption-attachment-2066" style="width: 2560px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-scaled.jpg"><img loading="lazy" decoding="async" data-attachment-id="2066" data-permalink="https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/voynich_folio_33/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-scaled.jpg" data-orig-size="2560,1732" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="voynich_folio_33" data-image-description="&lt;p&gt;Voynich Folio 33&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Folio 33&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-300x203.jpg" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-1024x693.jpg" src="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-scaled.jpg" alt="Voynich Folio 33" width="2560" height="1732" class="size-full wp-image-2066" srcset="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-scaled.jpg 2560w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-300x203.jpg 300w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-1024x693.jpg 1024w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-768x520.jpg 768w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-1536x1039.jpg 1536w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_33-2048x1386.jpg 2048w" sizes="auto, (max-width: 2560px) 100vw, 2560px" /></a><figcaption id="caption-attachment-2066" class="wp-caption-text">Voynich Folio 33</figcaption></figure>
<figure id="attachment_2067" aria-describedby="caption-attachment-2067" style="width: 2560px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-scaled.jpg"><img loading="lazy" decoding="async" data-attachment-id="2067" data-permalink="https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/voynich_folio_38/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-scaled.jpg" data-orig-size="2560,1715" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="voynich_folio_38" data-image-description="&lt;p&gt;Voynich Folio 38&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Folio 38&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-300x201.jpg" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-1024x686.jpg" src="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-scaled.jpg" alt="Voynich Folio 38" width="2560" height="1715" class="size-full wp-image-2067" srcset="https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-scaled.jpg 2560w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-300x201.jpg 300w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-1024x686.jpg 1024w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-768x515.jpg 768w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-1536x1029.jpg 1536w, https://www.weirddatascience.net/wp-content/uploads/2020/02/voynich_folio_38-2048x1372.jpg 2048w" sizes="auto, (max-width: 2560px) 100vw, 2560px" /></a><figcaption id="caption-attachment-2067" class="wp-caption-text">Voynich Folio 38</figcaption></figure>
<p>The statistical properties that we have uncovered in the Voynich Manuscript over the past four posts reveal something of its inner structure. It supports, but cannot prove, that the Manuscript is not a hoax, and that it the text is most likely drawn from some natural language.</p>
<p>The changepoint analyses in this post are a powerful tool for identifying evolution and mutation in data, and the demonstrated example of stylometric analysis to Martorel&#8217;s <em>Tirant lo Blanc</em> support their use in revealing points of fracture in texts, without reference to the source language.</p>
<p>With specific relevant to the Voynich Manuscript, both the word frequency changepoint and the topic model changepoint suggest that the manscript&#8217;s contents shift significantly at some point around Folios 30 to 40. Given the previous assignment of topics based on manual identification of images accompanying the text, this presents a new avenue of investigation for Voynich researchers.</p>
<p>We have resisted that tantalising draw of attempts to translate the Voynich Manuscript. The tools we have applied are more broadly statistical and aim at unveiling structures and revealing patterns in the text; whilst they may provide information towards deciphering the text, that particular conundrum is for the future.</p>
<p>There are, as we would always wish, many avenues left unexplored in this particular labyrinth. The topic model is crude, and more subtle disassemblies could well provide a more refined view. The word frequency patterns support natural language, but we have not made any effort to correlate them with known languages. We have treated words as a unit of analysis, but have not looked in detail at the structure of likely prefixes and suffixes; similar words with differing endings are particularly notable in the topic model, and analysis of these could reveal much more than we have dared to attempt.</p>
<p>There could be much to learn from assessing multiple changepoints in the Manuscript. The presentation of the volume certainly supports its composition of multiple disparate sections; perhaps identifying an inexorable sequence of stylistic shifts could unveil still more of this structure.</p>
<p>For now, however, our meandering journey into the dim twilight of the Voynich Manuscript has drawn to a close, leaving us still searching for illumination in the shadows of this most <a href="https://www.nsa.gov/Portals/70/documents/about/cryptologic-heritage/historical-figures-publications/publications/misc/voynich_manuscript.pdf">elegant enigma</a>.</p>
<p>Continue to search, in fear of what you may find.</p>
<hr />
<p>Code and data for this post: <a href="https://github.com/weirddatascience/weirddatascience/tree/master/20200220-voynich04-tempora_mutantur">https://github.com/weirddatascience/weirddatascience/tree/master/20200220-voynich04-tempora_mutantur</a>.</p>
<hr />
<h2>Footnotes</h2>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2020/02/21/illuminating-the-illuminated-part-four-tempora-mutantur-changepoint-analysis-of-the-voynich-manuscript/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2009</post-id>	</item>
		<item>
		<title>Illuminating the Illuminated – Part Three: Topics of Invention &#124; Topic Modelling the Voynich Manuscript</title>
		<link>https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/</link>
					<comments>https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Tue, 24 Dec 2019 16:51:08 +0000</pubDate>
				<category><![CDATA[bibliophilia]]></category>
		<category><![CDATA[cryptology]]></category>
		<category><![CDATA[linguistics]]></category>
		<guid isPermaLink="false">https://www.weirddatascience.net/?p=1200</guid>

					<description><![CDATA[<div class="mh-excerpt">Our <a href="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/">earlier experiments</a> derived some of the darker statistics of the Voynich Manuscript supporting the conjecture, but not erasing all doubt, that the manuscript's cryptic graphemes are drawn from some natural, or shudderingly unnatural, language. Despite our beliefs regarding its authenticity, however, the statistical tools we have employed so far can tell us little about the structure, and almost nothing of the meaning, of the Voynich Manuscript. </div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/" title="Illuminating the Illuminated – Part Three: Topics of Invention &#124; Topic Modelling the Voynich Manuscript">[...]</a>]]></description>
										<content:encoded><![CDATA[<p>Our <a href="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/">earlier experiments</a> derived some of the darker statistics of the Voynich Manuscript supporting the conjecture, but not erasing all doubt, that the manuscript&#8217;s cryptic graphemes are drawn from some natural, or shudderingly unnatural, language.</p>
<p>Despite our beliefs regarding its authenticity, however, the statistical tools we have employed so far can tell us little about the structure, and almost nothing of the meaning, of the Voynich Manuscript. In this post, whilst shying away from the madness and confusion of attempting to translate <a href="https://brbl-dl.library.yale.edu/vufind/Record/3519597">MS&nbsp;408</a>, or of definitively identifying its language, we will delve into the extent to which modern natural language processing techniques can reveal its lesser secrets.</p>
<p>The mechanisms we will apply in this post are drawn from the world of <em>topic modelling</em>, an approach widely used in the processing of human language to identify eerily related documents within a corpus of text.</p>
<p>Topic modelling, in its most widely used form, lies in considering each given document as a nebulous admixture of unseen and unknowable <em>topics</em>. These topics, in effect, are themselves probability distributions of words that are likely to occur together. Each document, therefore, is characterised as a set of probability distributions that generate the observed words. This approach, known as <a href="http://jmlr.csail.mit.edu/papers/v3/blei03a.html">Latent Dirichlet Allocation</a>, dispassionately extracts the hidden structure of documents by deriving these underlying distributions.</p>
<p>For known languages, latent Dirichlet allocation extrudes a set of topics characterised by the high-probability words that they generate. These, in turn, can be subjected to human interpretation to identify the semantic underpinnings behind the topics.</p>
<p>To illustrate, we present a topic model of Margaret A. Murray&#8217;s seminal 1921 work <a href="https://en.wikipedia.org/wiki/The_Witch-Cult_in_Western_Europe">&#8220;The Witch Cult in Western Europe&#8221;</a>. There are many uneasy subtleties in producing such a model, into which we will not plunge at this early stage; at a quick glance, however, we can see that from Murray&#8217;s detailed research and interweaved arguments for a modern-day survival of an ancient witch cult in Europe, the algorithm can identify certain prevalent themes. The third topic, for example, appears to conjure terms related to the conflict between the accepted state religion and the &#8216;heathen&#8217; witch cult. The ninth topic concerns itself with the <a href="https://en.wikipedia.org/wiki/Witches%27_mark">witches&#8217; marks</a>, supposedly identified on the body of practitioners; while the tenth dwells on the clandestine meetings and <a href="https://en.wikipedia.org/wiki/Witches%27_Sabbath">sabbaths</a> of the cult.</p>
<figure id="attachment_1206" aria-describedby="caption-attachment-1206" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.png"><img loading="lazy" decoding="async" data-attachment-id="1206" data-permalink="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/wcwe_topic_plot_13-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Witch Cult in Western Europe Topic Plot" data-image-description="&lt;p&gt;Topic plot for Murray&amp;#8217;s &amp;#8220;The Witch Cult in Western Europe&amp;#8221;&lt;/p&gt;
" data-image-caption="&lt;p&gt;Topic plot for Murray&amp;#8217;s &amp;#8220;The Witch Cult in Western Europe&amp;#8221;&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.png" alt="Topic plot for Murray&#039;s &quot;The Witch Cult in Western Europe&quot;" width="1920" height="1080" class="size-full wp-image-1206" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-1206" class="wp-caption-text">Topic plot for Murray&#8217;s &#8220;The Witch Cult in Western Europe&#8221; | (<a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/wcwe_topic_plot_13.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Witch Cult Topic Model Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>wcwe_topics.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( tidytext )<br />
library( widyr )</p>
<p>library( stm )<br />
library( quanteda )</p>
<p>library(cowplot)</p>
<p># For reorder_within() for facets: &lt;https://juliasilge.com/blog/reorder-within/&gt;<br />
library( drlib ) 		</p>
<p># Fonts<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># Read (processed) text of Murray&#8217;s &quot;The Witch Cult in Western Europe&quot;.<br />
wcwe_raw &lt;-<br />
	read_csv( &quot;data/wcwe/wcwe_raw.csv&quot;, col_names=FALSE ) %&gt;%<br />
	rename( text = X1 ) %&gt;%<br />
	rowid_to_column( var = &quot;chapter&quot; )</p>
<p># Tokenize<br />
# (Remove words of 3 letters or less)<br />
# Stemming and stopword removal apparently not so effective anyway,<br />
# according to Schofield et al.: &lt;www.cs.cornell.edu/~xanda/winlp2017.pdf&gt;<br />
wcwe_words &lt;-<br />
	wcwe_raw %&gt;%<br />
	unnest_tokens( word, text ) %&gt;%<br />
	filter( !word %in% stop_words$word )  %&gt;%<br />
	filter( str_length( word ) &gt; 3 )</p>
<p>wcwe_word_counts &lt;-<br />
	wcwe_words %&gt;%<br />
	count( word, chapter, sort = TRUE ) </p>
<p># Generate the corpus<br />
wcwe_dfm &lt;-<br />
	wcwe_words %&gt;%<br />
	count( chapter, word, sort=TRUE ) %&gt;%<br />
	cast_dfm( chapter, word, n )</p>
<p># Search for a number of topics and output goodness-of-fit measures. </p>
<p># N=2 is the number of documents &#8216;held out&#8217; for the goodness-of-fit measure.<br />
# (The model is trained on the main body, then used to calculated the<br />
# likelihood of the held-out documents.) N=2 is used here to produce<br />
# approximately 10% of the corpus.</p>
<p>if( not( file.exists( &quot;work/wcwe_topic_search_k.rds&quot; ) ) ) {</p>
<p>	message( &quot;Seaching low-n topic models&#8230;&quot; )</p>
<p>	wcwe_k &lt;-<br />
		searchK( wcwe_dfm, K=c(3:30), N=2 )</p>
<p>	saveRDS( wcwe_k, &quot;work/wcwe_topic_search_k.rds&quot; )</p>
<p>} else {</p>
<p>	wcwe_k &lt;-<br />
		readRDS( &quot;work/wcwe_topic_search_k.rds&quot; )</p>
<p>}</p>
<p># Plot semantic coherence against exclusivity for model selection<br />
wcwe_k_plot &lt;-<br />
	wcwe_k$results %&gt;%<br />
	gather( key=&quot;variable&quot;, value=&quot;value&quot;, exclus, semcoh )</p>
<p>wcwe_k_semcoh_exclusive &lt;-<br />
	ggplot( wcwe_k_plot, aes( x=K, y=value, group=variable) ) +<br />
	geom_line() +<br />
	facet_wrap( ~variable, ncol=1, scales=&quot;free_y&quot; )</p>
<p># Based on metrics of the above, calculate a 13-topic model<br />
if( not( file.exists( &quot;work/wcwe_topic_stm-13.rds&quot; ) ) ) {</p>
<p>	message( &quot;Calculating 13-topic model&#8230;&quot; )</p>
<p>	wcwe_topic_model_13 &lt;-<br />
		stm( wcwe_dfm, K=13, init.type=&quot;Spectral&quot; )</p>
<p>	# This takes a long time, so save output<br />
	saveRDS( wcwe_topic_model_13, &quot;work/wcwe_topic_stm-13.rds&quot; )</p>
<p>} else {</p>
<p>	wcwe_topic_model_13 &lt;- readRDS( &quot;work/wcwe_topic_stm-13.rds&quot; )</p>
<p>}</p>
<p># Work with the 13-topic model for now<br />
wcwe_topic_model &lt;- wcwe_topic_model_13</p>
<p>### Convert output to a tidy tibble<br />
wcwe_topic_model_tbl &lt;-<br />
	tidy(wcwe_topic_model, matrix = &quot;beta&quot; )</p>
<p>wcwe_topics_top &lt;-<br />
	wcwe_topic_model_tbl %&gt;%<br />
	group_by(topic) %&gt;%<br />
	top_n(10, beta) %&gt;%<br />
	ungroup() %&gt;%<br />
	arrange(topic, -beta)</p>
<p>gp &lt;-<br />
	wcwe_topics_top %&gt;%<br />
	mutate(term = reorder_within(term, beta, topic)) %&gt;%<br />
	ggplot(aes(term, beta, fill = factor(topic))) +<br />
	geom_col(show.legend = FALSE, alpha=0.8 ) +<br />
	facet_wrap(~ topic, scales = &quot;free&quot;) +<br />
	scale_x_reordered() +<br />
	coord_flip()</p>
<p># Palette of ink colours obtained from screenshots of Diamine inks.<br />
ink_colours &lt;- c( &quot;#753733&quot;, &quot;#b6091d&quot;, &quot;#e45025&quot;, &quot;#232d1d&quot;,<br />
					  	&quot;#224255&quot;, &quot;#533f50&quot;, &quot;#453437&quot;, &quot;#7f2430&quot;,<br />
						&quot;#254673&quot;, &quot;#52120e&quot;, &quot;#3d2535&quot;, &quot;#25464b&quot;,<br />
						&quot;#2f2a1c&quot; )</p>
<p>gp &lt;-<br />
	gp + scale_fill_manual( values=ink_colours )</p>
<p>topic_plot_13 &lt;-<br />
	gp + labs( x=&quot;Term&quot;, y=&quot;Probability in Topic&quot; ) +<br />
	theme (<br />
			 plot.title = element_text( family=&quot;bold_font&quot;, size=16 ),<br />
			 plot.subtitle = element_text( family=&quot;bold_font&quot;, size=12 ),<br />
			 axis.text.y = element_text( family=&quot;bold_font&quot;, size=10 )<br />
			 ) </p>
<p>theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p>wcwe_topic_plot &lt;-<br />
	topic_plot_13 +<br />
	theme (<br />
			 axis.title.y = element_text( angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.y = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.title.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title = element_blank(),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 legend.position=c(.85,.5),<br />
			 strip.background = element_blank(),<br />
			 strip.text.x = element_text(size = 10, family=&quot;main_font&quot;)<br />
			 ) </p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;\&quot;The Witch Cult in Western Europe\&quot; (Murray, 1921) Topic Model&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data: Murray, M. \&quot;The Witch Cult in Western Europe\&quot; (1921) | http://www.gutenberg.org/ebooks/20411&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=8, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, wcwe_topic_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>wcwe_topic_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>ggsave( &quot;output/wcwe_topic_plot_13.pdf&quot;, width=16, height=9 )</p>
[/code]
</div></div>
</div>
<p>As the plot above suggests, topic modelling is a tool to support our limited human understanding rather than a cold, mechanical source of objectivity and, as with much unsupervised machine learning, there are various subjective choices that must be made guided by the intended purpose of the analysis. Drawing together impercetible threads of relation in bodies of text, the approach suggests major themes and, crucially, can associate disparate areas of text that focus on similar concerns.</p>
<h1>Topical Remedies</h1>
<p>What, then, can we learn by bringing the oppressive weight of latent Dirichlet allocation to bear against a cryptic tome whose words, and indeed letters, resist our best efforts at interpretation?</p>
<p>Without understanding of individual words, we wil be unable to glean the semantic understanding of topics that was possible with Murray&#8217;s <em>Witch Cult&#8230;</em>. There is a chance, however, that the topic model can derive relations between separated sections of the manuscript &#8212; do certain early pages demonstrate a particular textual relationship to later pages? Do sections of the overall manuscript retain an apparent <em>coherence</em> of topics, with contiguous pages being drawn from a small range of similar topics? Which Voynich words fall under similar topics?</p>
<h1>Preparations</h1>
<p>Topic modelling typically requires text to undergo a certain level of formulaic preparation. The most common of such rituals are <a href="https://en.wikipedia.org/wiki/Stemming"><em>stemming</em></a>, <a href="https://en.wikipedia.org/wiki/Lemmatisation"><em>lemmatization</em></a>, and <a href="https://en.wikipedia.org/wiki/Stop_words"><em>stopword removal</em></a>. Briefly, stemming and lemmatization aim to reduce confusion by rendering words to their purest essence. Stemming is a more crude heuristic, unsympathetically incising endings, and so truncating <em>&#8220;dark&#8221;</em>, <em>&#8220;darker&#8221;</em>, <em>&#8220;darkest&#8221;</em> simply to the atomic root word <em>&#8220;dark&#8221;</em>. Lemmatization requires more understanding, untangling parts of speech and context: that <em>to curse</em> is a verb while <em>a curse</em> is a noun; the two identical words should therefore be treated separately.</p>
<p>Stopword removal aims to remove the overwhelming proportion of shorter, structural words that are ubiquitous throughout any text, but are largely irrelevant to the overall topic: <em>the, and, were, it, they, but, if&#8230;</em>. Whilst key to our understanding of texts, these terms have no significance to the theme or argument of a text.</p>
<p>Undermining our scheme to perform topic modelling, therefore, is the lamentable fact that, without understanding of either the text or its structure, we are largely unable to perform any of these tasks satisfactorily. We have neither an understanding of the grammatical form of Voynich words allowing stemming or lemmatization, or a list of stopwords to excise.</p>
<p>Whilst stemming and lemmatization are unapproachable, at least within the confines of this post, we can effect a crude form of stopword removal through use of a common frequency analysis of the text. Stopwords are, in general, those words that are both most-frequently occuring in some corpus of documents <em>and</em> those that are found across the majority of documents in that language. The second criterion ensures that words occurring frequently in obscure and specialised texts are not considered of undue importance.</p>
<p>This overall statistic is known as <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf"><em>term frequency-inverse document frequency</em></a>, or <a href="https://cran.r-project.org/web/packages/tidytext/vignettes/tf_idf.html"><em>tf-idf</em></a>, and is widely used in information retrieval to identify terms of specific interest within certain documents that are not shared by the wider corpus. For our purposes, we wish to identify and elide those ubiquitous, frequent terms that occur across the entire corpus. To do so, given our lack of knowledge of the structure of the Voynich Manuscript, we will consider each folio as a separate document, and consider only the <em>inverse document frequency</em> as we are uninterested in how common a word within each document. To avoid words that most commonly appear across the manuscript, with a basis in the distribution of stop words in a range of known languages, we therefore remove the 200 words with lowest inverse document frequency scores<span id='easy-footnote-5-1200' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/#easy-footnote-bottom-5-1200' title='The widely-used &lt;a href=&quot;https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip&quot;&gt;NLTK stopword corpus&lt;/a&gt; contains a list of stopwords for 23 world languages, with a notable bias towards European languages. The median length of these stopword lists is 201.5, with values ranging from 53 for Turkish to 1784 for Slovene.'><sup>5</sup></a></span>.</p>
<p>Having contorted the text into an appropriate form for analysis, we can begin the process of discerning its inner secrets. Our code relies on the <a href="https://www.tidytextmining.com/"><code>tidytext</code></a> and <a href="https://cran.r-project.org/package=stm"><code>stm</code></a> packages, allowing for easy manipulation of document structure and topic models<span id='easy-footnote-6-1200' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/#easy-footnote-bottom-6-1200' title='We have also relied extensively on the superb work and writing of &lt;a href=&quot;https://juliasilge.com/blog/sherlock-holmes-stm/&quot;&gt;Silge&lt;/a&gt; on this topic.'><sup>6</sup></a></span>
<h1>Numerous Interpretations</h1>
<p>Topic models are a cautionary example of recklessly <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised machine learning</a>. As with most such approaches, there are a number of subjective choices to be made that affect the outcome. Perhaps the most influential is the selection of the <em>number</em> of topics that the model should generate. Whilst some approaches have been suggested to derive this number purely by analysis, in most cases it remains in the domain of the human supplicant. Typically, the number of topics is guided both by the structure of the text along with whatever arcane purpose the analysis might have. With our imposed lack of understanding, however, we must rely solely on crude metrics to make this most crucial of choices.</p>
<p>Several methods of assessment exist to quantify the fit of a topic model to the text. The two that we will employ, guided by the <code>stm</code> package are <a href="https://rdrr.io/cran/stm/man/semanticCoherence.html"><em>semantic coherence</em></a>, which roughly expresses that words from a given topic should co-occur within a document; and <a href="https://rdrr.io/cran/stm/man/exclusivity.html"><em>exclusivity</em></a>, which values models more highly when given words occur within topics with high frequency, but are also relatively exclusive to those topics.</p>
<p>We select an optimal number of topics by the simple process of calculating models with varying numbers of topics, and assessing when these two scores are maximised. For the Voynich Manuscript we observe that 34 topics appears to be initially optimal<span id='easy-footnote-7-1200' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/#easy-footnote-bottom-7-1200' title='It should be noted that topic modelling is more typically applied to much larger corpora of text than is possible with our restriction to the Voynich Manuscript. Given the relatively short nature of the text, we might prefer to focus on a smaller number of topics. The metrics plot shows a spike in semantic coherence around 12 topics that might be of interest in future analyses.'><sup>7</sup></a></span>.</p>
<figure id="attachment_1211" aria-describedby="caption-attachment-1211" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.png"><img loading="lazy" decoding="async" data-attachment-id="1211" data-permalink="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/voynich_topic_selection_metrics-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="voynich_topic_selection_metrics" data-image-description="&lt;p&gt;Selection metrics for Voynich topic model topic numbers.&lt;/p&gt;
" data-image-caption="&lt;p&gt;Selection metrics for Voynich topic model topic numbers.&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.png" alt="Voynich Topic Model Selection Metrics" width="1920" height="1080" class="size-full wp-image-1211" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-1211" class="wp-caption-text">Selection metrics for Voynich topic model topic numbers. | (<a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_selection_metrics.pdf">PDF Version</a>)</figcaption></figure>
<p>The initial preparation of the code, the search through topic models of varying numbers, and the selection of the final 34 topic model is given in the code below alongside plotting code for the metrics diagram.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript Topic Modelling Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_topics-model.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( tidytext )<br />
library( widyr )</p>
<p>library( stm )</p>
<p># install_github(&quot;dgrtwo/drlib&quot;)<br />
library( drlib )</p>
<p># References:<br />
# &lt;http://varianceexplained.org/r/op-ed-text-analysis/&gt;<br />
# &lt;https://cbail.github.io/SICSS_Topic_Modeling.html#working-with-meta-data&gt;<br />
# &lt;https://me.eui.eu/andrea-de-angelis/blog/structural-topic-models-to-study-political-text-an-application-to-the-five-star-movements-blog/&gt;<br />
# &lt;https://scholar.princeton.edu/sites/default/files/bstewart/files/stm.pdf&gt;<br />
# &lt;https://juliasilge.com/blog/evaluating-stm/&gt;</p>
<p>voynich_raw &lt;-<br />
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Read in manually-identiied sections per folio, according to<br />
# &lt;http://www.voynich.nu/descr.html#illustr&gt;<br />
voynich_sections &lt;-<br />
	read_csv( &quot;data/voynich_sections.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, section = X2 )</p>
<p># Merge the above to note section for each folio alongside the text<br />
voynich_tbl &lt;-<br />
	left_join( voynich_sections, voynich_raw )</p>
<p># Tokenize<br />
# Remove words of 3 letters or less.<br />
voynich_words &lt;-<br />
	voynich_tbl %&gt;%<br />
	unnest_tokens( word, text ) %&gt;%<br />
	filter( str_length( word ) &gt; 3 )</p>
<p># Most common words<br />
voynich_common &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, sort=TRUE ) %&gt;%<br />
	mutate( word = reorder( word, n ) )</p>
<p># Counts of words per folio<br />
voynich_word_counts &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, folio, sort = TRUE ) </p>
<p># TF-IDF<br />
voynich_tf_idf &lt;-<br />
	voynich_word_counts %&gt;%<br />
	bind_tf_idf( word, folio, n ) %&gt;%<br />
	arrange( desc( tf_idf ) )</p>
<p># Based on median stopword count of languages in NLTK<br />
# (&lt;https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip&gt;),<br />
# remove the 200 lowest-scoring words.<br />
voynich_stopwords &lt;-<br />
	voynich_tf_idf %&gt;%<br />
	arrange( idf  ) %&gt;%<br />
	select( word ) %&gt;%<br />
	unique() %&gt;%<br />
	head( 200 ) %&gt;%<br />
	extract2( &quot;word&quot; )</p>
<p>voynich_words &lt;-<br />
	voynich_words %&gt;%<br />
	filter( !word %in% voynich_stopwords  ) </p>
<p># Generate the corpus<br />
voynich_dfm &lt;-<br />
	voynich_words %&gt;%<br />
	count( folio, word, sort=TRUE ) %&gt;%<br />
	cast_dfm( folio, word, n )</p>
<p># Search for a number of topics and output goodness-of-fit measures. </p>
<p># N=20 is the number of documents &#8216;held out&#8217; for the goodness-of-fit measure.<br />
# (The model is trained on the main body, then used to calculated the<br />
# likelihood of the held-out documents.) N=20 is used here to produce<br />
# approximately 10% of the corpus.<br />
if( not( file.exists( &quot;work/voynich_topic_search_k.rds&quot; ) ) ) {</p>
<p>	message( &quot;Seaching low-n topic models&#8230;&quot; )</p>
<p>	voynich_k &lt;-<br />
		searchK( voynich_dfm, K=c(3:40), N=20 )</p>
<p>	saveRDS( voynich_k, &quot;work/voynich_topic_search_k.rds&quot; )</p>
<p>} else {</p>
<p>	voynich_k &lt;-<br />
		readRDS( &quot;work/voynich_topic_search_k.rds&quot; )</p>
<p>}</p>
<p># Based on the metrics above, use 34-topic model<br />
if( not( file.exists( &quot;work/voynich_topic_stm-34.rds&quot; ) ) ) {</p>
<p>	message( &quot;Calculating 34-topic model&#8230;&quot; )</p>
<p>	# Setting K=0 uses (Lee and Minno, 2014) to select a number of topics<br />
	voynich_topic_model_34 &lt;-<br />
		stm( voynich_dfm, K=34, init.type=&quot;Spectral&quot; )</p>
<p>	# This takes a long time, so save output<br />
	saveRDS( voynich_topic_model_34, &quot;work/voynich_topic_stm-34.rds&quot; )</p>
<p>} else {</p>
<p>	voynich_topic_model_34 &lt;- readRDS( &quot;work/voynich_topic_stm-34.rds&quot; )</p>
<p>}</p>
<p># Based on the metrics above, also calculated a secondary 12-topic model<br />
if( not( file.exists( &quot;work/voynich_topic_stm-12.rds&quot; ) ) ) {</p>
<p>	message( &quot;Calculating 12-topic model&#8230;&quot; )</p>
<p>	# Setting K=0 uses (Lee and Minno, 2014) to select a number of topics<br />
	voynich_topic_model_12 &lt;-<br />
		stm( voynich_dfm, K=12, init.type=&quot;Spectral&quot; )</p>
<p>	# This takes a long time, so save output<br />
	saveRDS( voynich_topic_model_12, &quot;work/voynich_topic_stm-12.rds&quot; )</p>
<p>} else {</p>
<p>	voynich_topic_model_12 &lt;- readRDS( &quot;work/voynich_topic_stm-12.rds&quot; )</p>
<p>}</p>
<p># Work initially with the 34-topic model<br />
voynich_topic_model &lt;- voynich_topic_model_34</p>
<p>## Convert output to a tidy tibble<br />
voynich_topic_model_tbl &lt;-<br />
	tidy(voynich_topic_model, matrix = &quot;beta&quot; )</p>
<p>voynich_terms &lt;-<br />
	tidy(voynich_topic_model, matrix = &quot;gamma&quot; )</p>
<p># Select the top six terms in each topic for display<br />
voynich_topics_top &lt;-<br />
	voynich_topic_model_tbl %&gt;%<br />
	group_by(topic) %&gt;%<br />
	top_n(6, beta) %&gt;%<br />
	ungroup() %&gt;%<br />
	arrange(topic, -beta)</p>
<p># Produce a per-folio topic identification.<br />
# Document &#8211; topic &#8211; score<br />
topic_identity &lt;-<br />
	voynich_terms %&gt;%<br />
	group_by( document ) %&gt;%<br />
	top_n( 1, gamma ) %&gt;%<br />
	arrange( document ) %&gt;%<br />
	ungroup</p>
<p># Reinsert manually-identified section information (derived from<br />
# illustrations).<br />
topic_identity$section &lt;- voynich_tbl$section </p>
<p>saveRDS( topic_identity, &quot;work/topic_identity.rds&quot; )</p>
[/code]
</div></div>
</div>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Topic Metric Selection Plotting Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_topics-plot_metric.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( ggplot2 )<br />
library( cowplot )</p>
<p># References:<br />
# &lt;http://varianceexplained.org/r/op-ed-text-analysis/&gt;<br />
# &lt;https://cbail.github.io/SICSS_Topic_Modeling.html#working-with-meta-data&gt;<br />
# &lt;https://me.eui.eu/andrea-de-angelis/blog/structural-topic-models-to-study-political-text-an-application-to-the-five-star-movements-blog/&gt;<br />
# &lt;https://scholar.princeton.edu/sites/default/files/bstewart/files/stm.pdf&gt;<br />
# &lt;https://juliasilge.com/blog/evaluating-stm/&gt;</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># Read topic model search<br />
voynich_k &lt;-<br />
	readRDS( &quot;work/voynich_topic_search_k.rds&quot; )</p>
<p># Plot semantic coherence against exclusivity for model selection<br />
voynich_k_plot &lt;-<br />
	voynich_k$results %&gt;%<br />
	as_tibble %&gt;%<br />
	rename( &quot;Semantic Coherence&quot;=semcoh, &quot;Exclusivity&quot;=exclus ) %&gt;%<br />
	gather( key=&quot;variable&quot;, value=&quot;value&quot;, &quot;Semantic Coherence&quot;, &quot;Exclusivity&quot; ) %&gt;%<br />
	rename( &quot;Topic Count&quot;=K, &quot;Value&quot;=value )</p>
<p># We will use cowplot, so set the theme here.<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Plot semantic coherence against exclusivity.<br />
# (Highlight the selected 34-topic point.)<br />
voynich_k_semcoh_exclusive &lt;-<br />
	ggplot( voynich_k_plot, aes( x=`Topic Count`, y=Value, group=variable) ) +<br />
	geom_line( colour=&quot;#8a0707&quot; ) +<br />
	facet_wrap( ~variable, ncol=1, scales=&quot;free_y&quot; ) +<br />
	geom_vline( xintercept=34, colour=&quot;#228b22&quot;, linetype=&quot;longdash&quot; ) +<br />
	scale_x_continuous(breaks=c( seq(0, 40, 10), 34 ) ) +<br />
	theme (<br />
			 axis.title.y = element_text( angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.title.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.text.y = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title = element_blank(),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 legend.position=c(.85,.5),<br />
			 strip.background = element_blank(),<br />
			 strip.text.x = element_text(size = 10, family=&quot;main_font&quot;)<br />
	)  </p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Vonich Manuscript Topic Selection Metrics&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data: http://www.voynich.nu&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=8, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, voynich_k_semcoh_exclusive, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>voynich_topic_selection_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>ggsave( &quot;output/voynich_topic_selection_metrics.pdf&quot;, width=16, height=9 )</p>
[/code]
</div></div>
</div>
<p>With these torturous steps on our path finally trodden, our path leads at last to a model deriving the underlying word generating probabilities of the Voynich Manuscript. In each facet, the highest-probability words in each topic are shown in order.</p>
<figure id="attachment_1215" aria-describedby="caption-attachment-1215" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.png"><img loading="lazy" decoding="async" data-attachment-id="1215" data-permalink="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/voynich_topic_plot_34-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Voynich Manuscript Topic Model" data-image-description="&lt;p&gt;Voynich Manuscript Topic Model (34 topics)&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript Topic Model (34 topics)&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.png" alt="Voynich Manuscript Topic Model" width="1920" height="1080" class="size-full wp-image-1215" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-1215" class="wp-caption-text">Voynich Manuscript Topic Model (34 topics) | (<a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_topic_plot_34.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Topic Model Plotting Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_topics-plot_topics.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( tidytext )</p>
<p># install_github(&quot;dgrtwo/drlib&quot;)<br />
library( drlib )</p>
<p>library( ggplot2 )<br />
library( cowplot )</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># ####################<br />
# ## 34 Topic Model ##<br />
# ####################</p>
<p># Work with the 34-topic model<br />
voynich_topic_model &lt;-<br />
	readRDS( &quot;work/voynich_topic_stm-34.rds&quot; )</p>
<p>## Convert output to a tidy tibble<br />
voynich_topic_model_tbl &lt;-<br />
	tidy(voynich_topic_model, matrix = &quot;beta&quot; )</p>
<p>voynich_terms &lt;-<br />
	tidy(voynich_topic_model, matrix = &quot;gamma&quot; )</p>
<p># Select the top six terms in each topic for display<br />
voynich_topics_top &lt;-<br />
	voynich_topic_model_tbl %&gt;%<br />
	group_by(topic) %&gt;%<br />
	top_n(6, beta) %&gt;%<br />
	ungroup() %&gt;%<br />
	arrange(topic, -beta)</p>
<p># We will use cowplot, so set the theme here.<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Plot each topic as a geom_col(). Use drlib&#8217;s &#8216;reorder_within&#8217; to order bars<br />
# within each facet. (Note that the scale_x_reordered() is needed to fix<br />
# (flipped!) x-axis labels in the output.)<br />
# &lt;https://juliasilge.com/blog/reorder-within/&gt;<br />
gp &lt;-<br />
	voynich_topics_top %&gt;%<br />
	mutate(term = reorder_within(term, beta, topic)) %&gt;%<br />
	ggplot(aes(term, beta, fill = factor(topic))) +<br />
	geom_col( alpha=0.8, show.legend = FALSE) +<br />
	theme( axis.text.y = element_text( family=&quot;voynich_font&quot;, size=10 ) ) +<br />
	facet_wrap(~ topic, scales = &quot;free&quot;) +<br />
	scale_x_reordered() +<br />
	coord_flip() +<br />
	labs( x=&quot;Term&quot;, y=&quot;Probability in Topic&quot; )</p>
<p># Theming<br />
gp &lt;-<br />
	gp +<br />
	theme (<br />
			 axis.title.y = element_text( margin = margin(t = 0, r = 12, b = 0, l = 0), angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.title.x = element_text( margin = margin(t = 12, r = 0, b = 0, l = 0), colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=6 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title = element_blank(),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 legend.position=c(.85,.5),<br />
			 strip.background = element_blank(),<br />
			 strip.text.x = element_text(size = 10, family=&quot;main_font&quot;)<br />
			 ) </p>
<p>gp &lt;-<br />
	gp +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			legend.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;)<br />
	)</p>
<p># Palette of ink colours (based on screenshots of Diamine inks).<br />
ink_colours &lt;- c( &quot;#753733&quot;, &quot;#b6091d&quot;, &quot;#e45025&quot;, &quot;#232d1d&quot;,<br />
					  	&quot;#224255&quot;, &quot;#533f50&quot;, &quot;#453437&quot;, &quot;#7f2430&quot;,<br />
						&quot;#254673&quot;, &quot;#52120e&quot;, &quot;#3d2535&quot;, &quot;#25464b&quot;,<br />
						&quot;#2f2a1c&quot; )</p>
<p># Create a vector of selections from the palette, one for each topic.<br />
ink_palette &lt;-<br />
	sample( ink_colours, size=34, replace=TRUE )</p>
<p># Add fill colours to plot.<br />
gp &lt;-<br />
	gp + scale_fill_manual( values=ink_palette )</p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Vonich Manuscript Topic Model (34 Topics)&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data: http://www.voynich.nu&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=8, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, gp, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>voynich_topic_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>ggsave( &quot;output/voynich_topic_plot_34.pdf&quot;, width=16, height=9 )</p>
[/code]
</div></div>
</div>
<h1>Of Man and Machine</h1>
<p>The topic model produces a set of topics in the form of probability distributions generating words. The association of each topic to a folio in the Voynich Manuscript represents these probabilistic assignments based solely on the distribution of words in the text. There is a secondary topic identification, however, tentatively proposed by scholars of the manuscript. The obscure diagrams decorating almost every folio provide their own startling implications as to the themes detailed in the undeciphered prose.</p>
<p>We might wish to ask, then: do the topic assignments generated by the machine reflect the human intepretation? To what extent do pages decorated with herbal illuminations follow certain machine-identified topics compared with those assigned to astronomical charts?</p>
<p>The illustration-based thematic sections of the Voynich Manuscript fall into eight broad categories, according to <a href="http://www.voynich.nu/descr.html#illustr">Zandbergen</a>. These sections are, briefly:</p>
<ul>
<li>Herbal, detailing a range of unidentified plants, comprising most of the first half of the manuscript;</li>
<li>astronomical, focusing on stars, planets, and astronomical symbols;</li>
<li>cosmological, displaying obscure circular diagrams of a similar form to the astronomical;</li>
<li>astrological, in which small humans are displayed mostly in circular diagrams alongside zodiac signs;</li>
<li>biological, characterised by small drawings of human figures, often connected by tubes;</li>
<li>pharmaceutical, detailing parts of plants and vessels for their preparation;</li>
<li>starred text, divided into short paragraphs marked with a star, with no other illustrations; and</li>
<li>text only pages.</li>
</ul>
<p>With these contextual descriptions, we can examine the relationship between the speculative assignments of the topic model against the suggestions of the diagrams.</p>
<figure id="attachment_1219" aria-describedby="caption-attachment-1219" style="width: 1920px" class="wp-caption aligncenter"><a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.png"><img loading="lazy" decoding="async" data-attachment-id="1219" data-permalink="https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/voynich_folio_topic_heatmap-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Voynich Manuscript Folio Topic Heatmap" data-image-description="&lt;p&gt;Voynich Manuscript Folio Topic Heatmap&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript Folio Topic Heatmap&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.png" alt="Voynich Manuscript Folio Topic Heatmap" width="1920" height="1080" class="size-full wp-image-1219" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-1219" class="wp-caption-text">Voynich Manuscript Folio Topic Heatmap | (<a href="https://www.weirddatascience.net/wp-content/uploads/2019/12/voynich_folio_topic_heatmap.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Manuscript Topic Model Folio Heatmap Plotting Code</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_topics-plot_heatmap.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p># install_github(&quot;dgrtwo/drlib&quot;)<br />
library( drlib )</p>
<p>library( ggplot2 )<br />
library( cowplot )</p>
<p>font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p># Set the number of topics<br />
n_topics &lt;- 34</p>
<p># Load the appropriate topic model<br />
voynich_topic_model &lt;-<br />
	readRDS( paste0( &quot;work/voynich_topic_stm-&quot;, n_topics, &quot;.rds&quot; ))</p>
<p>theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Load folio topic identity assignments<br />
topic_identity &lt;-<br />
	readRDS( &quot;work/topic_identity.rds&quot; )</p>
<p># Cowplot<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Plot topic heatmap<br />
topic_heatmap &lt;-<br />
	topic_identity %&gt;%<br />
	ggplot( aes( x=document, y=topic, fill=section ) ) +<br />
	geom_tile( colour=&quot;#3c3f4a&quot;, alpha=0.8, size=0.4 ) +<br />
	scale_fill_brewer( palette=&quot;Dark2&quot;, direction=1, name=&quot;Section&quot;, labels=c(&quot;Astrological&quot;, &quot;Astronomical&quot;, &quot;Biological&quot;, &quot;Cosmological&quot;, &quot;Herbal&quot;, &quot;Pharmaceutical&quot;, &quot;Starred Text&quot;, &quot;Text Only&quot; ) ) +<br />
	ggtitle( &quot;Voynich Folio Topic Assignments&quot;, paste( n_topics, &quot;Topic Model&quot; )) +<br />
	labs( x=&quot;Folio&quot;, y=&quot;Topic&quot; ) +<br />
	theme (<br />
			 plot.title = element_text( family=&quot;bold_font&quot;, size=22 ),<br />
			 plot.subtitle = element_text( family=&quot;bold_font&quot;, size=12 ),<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 ) +<br />
	scale_y_continuous(labels = seq( 1, n_topics, 1 ), breaks = seq( 1, n_topics, 1 ), minor_breaks = seq(0.5 , n_topics+.5, 1) ) +<br />
	scale_x_continuous(minor_breaks = seq(0.5 , 226.5, 5) ) </p>
<p>gp &lt;-<br />
	topic_heatmap +<br />
	theme (<br />
			 axis.title.y = element_text( margin = margin(t = 0, r = 12, b = 0, l = 0), angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.title.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.text.y = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=10 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title =element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 strip.background = element_blank(),<br />
			 strip.text.x = element_text(size = 10, family=&quot;main_font&quot;)<br />
			 ) </p>
<p>gp &lt;-<br />
	gp +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			legend.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;)<br />
	)</p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Voynich Manuscript Topic Heatmap&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data: http://www.voynich.nu&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=8, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, gp, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>voynich_topic_heatmap &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>ggsave( &quot;output/voynich_folio_topic_heatmap.pdf&quot;, width=16, height=9 )</p>
[/code]
</div></div>
</div>
<p>The colours in the above plot represent the manual human interpretation, whilst the location on the y-axis shows the latent Dirichlet allocation topic assignment.</p>
<p>We might have harboured the fragile hope that such a diagram would have demonstrated a clear confirmatory delineation between the sectional diagrammatic breakdown of the Voynich Manuscript. At a first inspection, however, the topics identified by the analysis appear almost uniformly distributed across the pages of the manuscript.</p>
<p>The topic model admits to a number of assumptions, not least the selection of stopwords through to the number of topics in the model. We must also be cautious: the apparent distribution of topics over the various sections may be deceptive. For the moment, we can present this initial topic model as a faltering first step in our descent into the hidden structures of the Voynich Manuscript. The next, and final, post in this series will develop both the statistical features and the topic model towards a firmer understanding of whether the apparent shift in theme suggest by the illustrations is statistically supported by the text.</p>
<p>Until then, read deeply but do not trust what you read.</p>
<hr />
<h2>Footnotes</h2>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2019/12/24/illuminating-the-illuminated-part-three-topics-of-invention-topic-modelling-the-voynich-manuscript/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1200</post-id>	</item>
		<item>
		<title>Illuminating the Illuminated &#8211; Part Two: Ipsa Scientia Potestas Est &#124; Power Laws in the Voynich Manuscript</title>
		<link>https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/</link>
					<comments>https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Mon, 28 Oct 2019 09:42:02 +0000</pubDate>
				<category><![CDATA[bibliophilia]]></category>
		<category><![CDATA[cryptology]]></category>
		<guid isPermaLink="false">http://www.weirddatascience.net/?p=945</guid>

					<description><![CDATA[<div class="mh-excerpt">In the <a href="https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/">previous post</a> in this series we coyly unveiled the tantalising mysteries of the Voynich Manuscript: an early 15th century text written in an unknown alphabet, filled with compelling illustrations of plants, humans, astronomical charts, and less easily-identifiable entities. Stretching back into the murky history of the Voynich Manuscript, however, is the lurking suspicion that it is a fraud; either a modern fabrication or, perhaps, a hoax by a contemporary scribe.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/" title="Illuminating the Illuminated &#8211; Part Two: Ipsa Scientia Potestas Est &#124; Power Laws in the Voynich Manuscript">[...]</a>]]></description>
										<content:encoded><![CDATA[<p>In the <a href="https://www.weirddatascience.net/index.php/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/">previous post</a> in this series we coyly unveiled the tantalising mysteries of the Voynich Manuscript: an early 15th century text written in an unknown alphabet, filled with compelling illustrations of plants, humans, astronomical charts, and less easily-identifiable entities.</p>
<p>Stretching back into the murky history of the Voynich Manuscript, however, is the lurking suspicion that it is a fraud; either a modern fabrication or, perhaps, a hoax by a contemporary scribe.</p>
<p>One of the more well-known arguments for the authenticity of the manuscript, in addition to its manufacture with period parchment and inks, is that the text appears to follow certain statistical properties associated with human language, and which were unknown at the time of its creation.</p>
<p>The most well-known of these properties is that the frequency of words in the Voynich Manuscript have been claimed to follow a phenomenon known as <a href="https://en.wikipedia.org/wiki/Zipf%27s_law"><em>Zipf&#8217;s Law</em></a>, whereby the frequency of a word&#8217;s occurrence in the text is inversely proportional to its rank in the list of words ordered by frequency.</p>
<p>In this post, we will scrutinise the extent to which the expected statistical properties of natural languages hold for the arcane glyphs presented by the Voynich manuscript.</p>
<h1>Unnatural Laws</h1>
<p>Zipf&#8217;s Law is an example of a discrete <a href="https://en.wikipedia.org/wiki/Power_law">power law probability distribution</a>. Power laws have been found to lurk beneath a sinister variety of ostensibly natural phenomena, from the relative size of human settlements to the diversity of species descended from a particular ancestral freshwater fish.</p>
<p>In its original context of human langauge, Zipf&#8217;s Law states that the most common word in a given language is likely to be roughly twice as common as the second most common word, and three times as common as the third most common word. More precisely, this law holds <em>for much of the corpus</em>, as the law tends to break down somewhat at both the most-frequent and least-frequent words in the corpus<span id='easy-footnote-8-945' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/#easy-footnote-bottom-8-945' title='The distributions observed in natural languages are therefore best described as following several parameterisations of the slightly more complex, generalised &lt;a href=&quot;https://en.wikipedia.org/wiki/Zipf%E2%80%93Mandelbrot_law&quot;&gt;Zipf–Mandelbrot distribution&lt;/a&gt;, with different parameters for the most common, middle, and least-common segments of the corpus.'><sup>8</sup></a></span>. Despite this, we will focus on the original, simpler Zipfian characterisation in this analysis.</p>
<p>The most well-known, if highly flawed, method to determine whether a distribution follows a power law is to plot it with both axes expressed as a log-scale: a so-called <a href="https://en.wikipedia.org/wiki/Log-log">log-log plot</a>. A power law, represented in such a way, will appear linear. Unfortunately, <a href="https://web.archive.org/web/20190718122537/http://bactra.org/weblog/491.html">a hideous menagerie of other distributions will also appear linear in such a setting</a>.</p>
<p>More generally, it is rarely sensible to claim that any natural phenomenon <em>follows</em> a given distribution or model, but instead to demonstrate that a distribution presents <a href="https://en.wikipedia.org/wiki/All_models_are_wrong"><em>a useful model</em></a> for a given set of observations. Indeed, it is possible to fit any set of observations to a power law, with the assumption that the fit will be poor. Ultimately, we can do little more than demonstrate that a given model is the best simulacrum of observed reality, subject to the uses to which it will be put. Certainly, a more Bayesian approach would advocate building a range of models, demonstrating that the power law is most accurate. All truth, it seems, is relative.</p>
<p>Faced with the awful statistical horror of the universe, we are reduced to seeking evidence <em>against</em> a phenomenon&#8217;s adherence to a given distribution. Our first examination, then, is to see whether the basic log-log plot supports or undermines the Voynich Manuscript.</p>
<figure id="attachment_953" aria-describedby="caption-attachment-953" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_power_law.png"><img loading="lazy" decoding="async" data-attachment-id="953" data-permalink="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/voynich_power_law-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_power_law.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Fitted Power Law of Voynich Corpus" data-image-description="&lt;p&gt;Fitted Power Law of Voynich Corpus&lt;/p&gt;
" data-image-caption="&lt;p&gt;Fitted Power Law of Voynich Corpus&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_power_law.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_power_law.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_power_law.png" alt="Fitted Power Law of Voynich Corpus" width="1920" height="1080" class="size-full wp-image-953" /></a><figcaption id="caption-attachment-953" class="wp-caption-text">Fitted Power Law of Voynich Corpus | (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_power_law.pdf">PDF Version</a>)</figcaption></figure>
<p>A crude visual analysis certainly supports the argument that, for much of the upper half of the Voynich corpus, there is a linear relationship on the log-log plot consistent with Zipf&#8217;s Law. As mentioned, however, this superficial appeal to our senses leaves a gnawing lack of certainty in the conclusion. We must turn to less fallible tools.</p>
<p>The <a href="https://cran.r-project.org/package=poweRlaw">poweRlaw</a> package for R is designed specifically to exorcise these particular demons. This package attempts to fit a power law distribution to a series of observations, in our case the word frequencies observed in the corpus of Voynich text. With the fitted model, we then attempt to <em>disprove</em> the null hypothesis that the data is drawn from a power law. If this attempt to betray our own model fails, then we attain an inverse enlightenment: there is insufficient evidence that the model is <em>not</em> drawn from a power law.</p>
<p>This is an inversion of the more typical <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5635437/">frequentist null hypothesis</a> scenario. Typically, in such approaches, we hope for a low p-value, typically below 0.05 or even 0.001, showing that the chance of the observations being consistent with the null hypothesis is extremely low. For this test, we instead hope that our p-value is <em>insufficiently</em> low to make such a claim, and thus that a power law <em>is</em> consistent with the data.</p>
<p>The diagram above shows a fitted parameterisation of the power law according to the poweRlaw package. In addition to the visually appealing fit of the line, the weirdly inverted logic of the above test provides a p-value of <code>0.151</code>. We thus have as much confidence as we can have, via this approach, that a power law is a reasonable model for the text in the Voynich corpus.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Voynich Power Law Fit and Plot</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_powerlaw.r</code><br />
[code language=&#8221;r&#8221;]
library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( tidytext )<br />
library( drlib )</p>
<p>library( poweRlaw )</p>
<p>library(cowplot)<br />
library(magick)</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)\</p>
<p>showtext_auto()</p>
<p>message( &quot;Reading raw Voynich data&#8230;&quot; )<br />
voynich_tbl &lt;-<br />
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Tokenize<br />
voynich_words &lt;-<br />
	voynich_tbl %&gt;%<br />
	unnest_tokens( word, text ) </p>
<p># Most common words<br />
message( &quot;Calculating Voynich language statistics&#8230;&quot; )<br />
voynich_common &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, sort=TRUE ) %&gt;%<br />
	mutate( word = reorder( word, n ) ) %&gt;%<br />
	mutate( freq = n / sum(n) )</p>
<p>voynich_word_counts &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, folio, sort = TRUE ) </p>
<p># (Following the poweRlaw vignette)<br />
# Create a discrete power law distribution object from the word counts<br />
voynich_powerlaw &lt;-<br />
	voynich_common %&gt;%<br />
	extract2( &quot;n&quot; ) %&gt;%<br />
	displ$new()</p>
<p># Estimate the lower bound<br />
voynich_powerlaw_xmin &lt;-<br />
	estimate_xmin( voynich_powerlaw )</p>
<p># Set the parameters of the voynich_powerlaw to the estimated values<br />
voynich_powerlaw$setXmin( voynich_powerlaw_xmin )</p>
<p># Estimate parameters of the power law distribution<br />
voynich_powerlaw_est &lt;-<br />
	estimate_pars( voynich_powerlaw )</p>
<p># Calculate p-value of power law. See Section 4.2 of &quot;Power-Law Distributions in Empirical Data&quot; by Clauset et al.<br />
# If the p-value is _greater_ than 0.1 then we cannot rule out a power-law distribution.<br />
voynich_powerlaw_bootstrap_p &lt;-<br />
	bootstrap_p(voynich_powerlaw, no_of_sims=1000, threads=7 )<br />
# p=0.143 power law cannot be ruled out</p>
<p># Parameter uncertainty via boostrapping<br />
voynich_powerlaw_bootstrap &lt;-<br />
	bootstrap( voynich_powerlaw, no_of_sims=1000, threads=7 )</p>
<p># Plot data and power law fit<br />
voynich_powerlaw_plot_data &lt;-<br />
	plot( voynich_powerlaw, draw = F ) %&gt;%<br />
	mutate(<br />
			 log_x = log( x ),<br />
			 log_y = log( y ) )</p>
<p>voynich_powerlaw_fit_data &lt;-<br />
	lines( voynich_powerlaw, col=2, draw = F ) %&gt;%<br />
	mutate(<br />
			 log_x = log( x ),<br />
			 log_y = log( y )<br />
	)</p>
<p># Plot the fitted power law data.<br />
gp &lt;-<br />
	ggplot( voynich_powerlaw_plot_data ) +<br />
	geom_point( aes( x = log( x ), y =log( y ) ), colour=&quot;#8a0707&quot; ) +<br />
	geom_line( data= voynich_powerlaw_fit_data,<br />
				 aes(<br />
					  x = log( x ),<br />
					  y = log( y ) ), colour=&quot;#0b6788&quot;) +<br />
	labs(<br />
		  x = &quot;Log Rank&quot;,<br />
		  y = &quot;Log Frequency&quot; )</p>
<p>gp &lt;-<br />
	gp +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.title = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=22 ),<br />
			plot.subtitle = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.title.x = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.title.y = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.line = element_line( colour=&quot;#3c3f4a&quot; ),<br />
			panel.grid.major.x = element_blank(),<br />
			panel.grid.major.y = element_blank(),<br />
			panel.grid.minor.x = element_blank(),<br />
			panel.grid.minor.y = element_blank()<br />
			) </p>
<p># Remove legend from internal plot<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(<br />
				  &quot;Fitted Power Law of Voynich Corpus&quot;,<br />
				  fontfamily=&quot;bold_font&quot;,<br />
				  colour = &quot;#3c3f4a&quot;,<br />
				  size=20,<br />
				  hjust=0, vjust=1,<br />
				  x=0.02, y=0.88 ) +<br />
	draw_label(<br />
				  &quot;http://www.weirddatascience.net | @WeirdDataSci&quot;,<br />
				  fontfamily=&quot;main_font&quot;,<br />
				  colour = &quot;#3c3f4a&quot;,<br />
				  size=12,<br />
				  hjust=0, vjust=1,<br />
				  x=0.02, y=0.40 )</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(<br />
				  &quot;Data: http://www.voynich.nu&quot;,<br />
				  fontfamily=&quot;main_font&quot;,<br />
				  colour = &quot;#3c3f4a&quot;,<br />
				  size=14, hjust=1,<br />
				  x=0.98 ) </p>
<p># Combine plots<br />
tgp &lt;-<br />
	plot_grid(<br />
				 title,<br />
				 gp,<br />
				 data_label,<br />
				 ncol=1,<br />
				 rel_heights=c( 0.1, 1, 0.1 ) ) </p>
<p># Add parchment underlay<br />
parchment_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>save_plot(&quot;output/voynich_power_law.pdf&quot;,<br />
							parchment_plot,<br />
							base_width = 16,<br />
							base_height = 9,<br />
			           	base_aspect_ratio = 1.78 )<br />
[/code]
</div></div>
</div>
<p>Led further down twisting paths by this initial taste of success, we can now present the Voynich corpus against other human-language corpora to gain a faint impression of how similar or different it is to known languages. The following plot compares the frequency of words in the Voynich Manuscript to those of the twenty most popular languages in Wikipedia, taken from the dataset available <a href="https://www.datos.gov.co/en/Ciencia-Tecnolog-a-e-Innovaci-n/Multilingual-Wikipedia-2015-word-frequencies-32-la/jmxq-gzgh/data">here</a>.</p>
<figure id="attachment_956" aria-describedby="caption-attachment-956" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_wikipedia_plot.png"><img loading="lazy" decoding="async" data-attachment-id="956" data-permalink="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/voynich_wikipedia_plot-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_wikipedia_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora" data-image-description="&lt;p&gt;Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_wikipedia_plot.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_wikipedia_plot.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_wikipedia_plot.png" alt="Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora" width="1920" height="1080" class="size-full wp-image-956" /></a><figcaption id="caption-attachment-956" class="wp-caption-text">Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora | (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_wikipedia_plot.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Wikipedia Word Frequency Plot</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p>Data:</p>
<ul>
<li>Wikipedia Language Corpora: <a href="https://www.datos.gov.co/en/Ciencia-Tecnolog-a-e-Innovaci-n/Multilingual-Wikipedia-2015-word-frequencies-32-la/jmxq-gzgh/data">https://www.datos.gov.co/en/Ciencia-Tecnolog-a-e-Innovaci-n/Multilingual-Wikipedia-2015-word-frequencies-32-la/jmxq-gzgh/data</a></li>
</ul>
<p><code>voynich_zipf_wikipedia.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( tidytext )<br />
library( drlib )</p>
<p>library(cowplot)<br />
library(magick)</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>message( &quot;Reading raw Voynich data&#8230;&quot; )<br />
voynich_tbl &lt;-<br />
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Tokenize<br />
# (Remove words of 3 letters or less)<br />
# Stemming and stopword removal apparently not so effective anyway,<br />
# according to Schofield et al.: &lt;www.cs.cornell.edu/~xanda/winlp2017.pdf&gt;<br />
voynich_words &lt;-<br />
	voynich_tbl %&gt;%<br />
	unnest_tokens( word, text ) </p>
<p># Most common words<br />
message( &quot;Calculating Voynich language statistics&#8230;&quot; )<br />
voynich_common &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, sort=TRUE ) %&gt;%<br />
	mutate( word = reorder( word, n ) ) %&gt;%<br />
	mutate( freq = n / sum(n) )</p>
<p># Plot a log-log plot of Voynich word frequencies.<br />
voynich_word_counts &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, folio, sort = TRUE ) </p>
<p># Load other languages.<br />
# Select frequency counts.<br />
# Convert to long format, then normalise per-language.<br />
message( &quot;Loading common language statistics&#8230;&quot; )<br />
wiki_language &lt;-<br />
	read.csv( &quot;data/Multilingual_Wikipedia_2015_word_frequencies__32_languages_X_5_million_words.csv&quot; ) %&gt;%<br />
	head( 10000 ) %&gt;%<br />
	as_tibble %&gt;%<br />
	select( matches( &quot;*_FREQ&quot; ) ) %&gt;%<br />
	gather( key = &quot;language&quot;, value = &quot;count&quot; ) %&gt;%<br />
	mutate( language = str_replace( language, &quot;_FREQ&quot;, &quot;&quot; ) ) %&gt;%<br />
	group_by( language ) %&gt;%<br />
	transmute( freq = count / sum( count ) ) %&gt;%<br />
	ungroup</p>
<p>wiki_language_words &lt;-<br />
	read.csv( &quot;data/Multilingual_Wikipedia_2015_word_frequencies__32_languages_X_5_million_words.csv&quot; ) %&gt;%<br />
	head( 10000 ) %&gt;%<br />
	as_tibble </p>
<p># Combine with Voynich, assigning it the unassigned ISO 3166-1 alpha-2 code &quot;vy&quot;<br />
message( &quot;Combining common and Voynich language statistics&#8230;&quot; )<br />
voynich_language &lt;-<br />
	voynich_common %&gt;%<br />
	transmute( language = &quot;vy&quot;, freq = freq )</p>
<p># Combine, then add per-language rank information<br />
message( &quot;Processing common and Voynich language statistics&#8230;&quot; )<br />
all_languages &lt;-<br />
	bind_rows( wiki_language, voynich_language ) %&gt;%<br />
	mutate( colour = ifelse( str_detect( `language`, &quot;vy&quot; ), &quot;red&quot;, &quot;grey&quot; ) ) %&gt;%<br />
	group_by( language ) %&gt;%<br />
	transmute( log_rank=log( row_number() ), log_freq=log( freq ), colour ) %&gt;%<br />
	ungroup </p>
<p># Plot a log-log plot of all language word frequencies.<br />
message( &quot;Plotting common and Voynich language statistics&#8230;&quot; )<br />
voynich_wikipedia_plot &lt;-<br />
	all_languages %&gt;%<br />
	ggplot( aes( x=log_rank, y=log_freq, colour=colour) ) +<br />
	geom_point( alpha=0.4, shape=20 ) +<br />
	scale_color_manual( values=c(&quot;#3c3f4a&quot;, &quot;#8a0707&quot; ) ) +<br />
	theme (<br />
			 axis.title.y = element_text( angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.y = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.title.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ) # bg of the panel<br />
			 ) +<br />
	#scale_colour_viridis_d( option=&quot;cividis&quot;, begin=0.4 ) +<br />
	guides( colour=&quot;none&quot; ) +<br />
	labs( y=&quot;Log Frequency&quot;,<br />
			x=&quot;Log Rank&quot; )</p>
<p>theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Cowplot trick for ggtitle<br />
title &lt;- ggdraw() +<br />
	draw_label(<br />
				  &quot;Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora&quot;,<br />
				  fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20,<br />
				  hjust=0, vjust=1, x=0.02, y=0.88 ) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;,<br />
				  fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12,<br />
				  hjust=0, vjust=1, x=0.02, y=0.40 )</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data from: http://www.voynich.nu | http://wikipedia.org&quot;,<br />
				  fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12,<br />
				  hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, voynich_wikipedia_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>voynich_wikipedia_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>save_plot(&quot;output/voynich_wikipedia_plot.pdf&quot;,<br />
							voynich_wikipedia_plot,<br />
							base_width = 16,<br />
							base_height = 9,<br />
			           	base_aspect_ratio = 1.78 )<br />
[/code]
</div></div>
</div>
<p>The Voynich text seems consistent with the behaviour of known natural languages from Wikipedia. The most striking difference being the clustering of Voynich word frequencies in the lower half of the diagram, resulting from the smaller corpus of words in the Voynich Manuscript. This causes, in particular, lower-frequency words to occur an identical number of times, resulting in vertical leaps in the frequency graph towards the lower end.</p>
<p>To highlight this phenomenon, we can apply a similar technique to another widely-translated short text: the United Nations Declaration of Human Rights.</p>
<figure id="attachment_954" aria-describedby="caption-attachment-954" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_udhr_plot.png"><img loading="lazy" decoding="async" data-attachment-id="954" data-permalink="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/voynich_udhr_plot/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_udhr_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Voynich Manuscript Rank Frequency Distribution against UNDHR Translations" data-image-description="&lt;p&gt;Voynich Manuscript Rank Frequency Distribution against UNDHR Translations&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript Rank Frequency Distribution against UNDHR Translations&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_udhr_plot.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_udhr_plot.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_udhr_plot.png" alt="Voynich Manuscript Rank Frequency Distribution against UNDHR Translations" width="1920" height="1080" class="size-full wp-image-954" /></a><figcaption id="caption-attachment-954" class="wp-caption-text">Voynich Manuscript Rank Frequency Distribution against UNDHR Translations | (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_udhr_plot.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>UNDHR Word Frequency Plot</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_zipf_udhr.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( tidytext )</p>
<p>library(cowplot)<br />
library(magick)</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)</p>
<p>showtext_auto()</p>
<p>message( &quot;Reading raw Voynich data&#8230;&quot; )<br />
voynich_tbl &lt;-<br />
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Tokenize<br />
# (Remove words of 3 letters or less)<br />
# Stemming and stopword removal apparently not so effective anyway,<br />
# according to Schofield et al.: &lt;www.cs.cornell.edu/~xanda/winlp2017.pdf&gt;<br />
voynich_words &lt;-<br />
	voynich_tbl %&gt;%<br />
	unnest_tokens( word, text ) </p>
<p># Most common words<br />
message( &quot;Calculating Voynich language statistics&#8230;&quot; )<br />
voynich_common &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, sort=TRUE ) %&gt;%<br />
	mutate( word = reorder( word, n ) ) %&gt;%<br />
	mutate( freq = n / sum(n) )</p>
<p># Combine with Voynich, assigning it the unassigned ISO 3166-1 alpha-2 code &quot;vy&quot;<br />
message( &quot;Combining common and Voynich language statistics&#8230;&quot; )<br />
voynich_language &lt;-<br />
	voynich_common %&gt;%<br />
	transmute( language = &quot;vy&quot;, freq = freq )</p>
<p>voynich_word_counts &lt;-<br />
	voynich_words %&gt;%<br />
	count( word, folio, sort = TRUE ) </p>
<p># UDHR corpus comparison (smaller text)<br />
udhr_corpus_files &lt;- list.files(&quot;data/udhr/udhr_txt&quot;, pattern=&quot;*.txt&quot;, full.names=TRUE )</p>
<p># Helper function to read in a text file and calculate a frequency tablle<br />
table_frequency_mapper &lt;- function( x ) {</p>
<p>	# Read file and extract language code from filename<br />
	udhr_text &lt;- read_lines( x, skip=6, skip_empty_rows=TRUE )<br />
	language &lt;- basename( x ) %&gt;% str_replace( &quot;udhr_&quot;, &quot;&quot; ) %&gt;% str_replace( &quot;.txt&quot;, &quot;&quot; ) </p>
<p>	# Tokenize and remove punctuation<br />
	udhr_words &lt;-<br />
		udhr_text %&gt;%<br />
		str_flatten %&gt;%<br />
		str_remove_all( &quot;[.,]&quot; ) %&gt;%<br />
		str_split( &quot;\\s+&quot; ) %&gt;%<br />
		extract2( 1 ) %&gt;%<br />
		{ tibble( word=. ) }</p>
<p>	# Most common words<br />
	udhr_common &lt;-<br />
		udhr_words %&gt;%<br />
		count( word, sort=TRUE ) %&gt;%<br />
		mutate( word = reorder( word, n ), language )</p>
<p>}</p>
<p>voynich_corpus &lt;-<br />
	voynich_language %&gt;%<br />
	transmute( language, log_rank=log( row_number() ), log_freq=log( freq ), colour=&quot;Voynich Text&quot; )</p>
<p>udhr_corpus &lt;-<br />
	udhr_corpus_files %&gt;%<br />
	map( table_frequency_mapper ) %&gt;%<br />
	bind_rows %&gt;%<br />
	group_by( language ) %&gt;%<br />
	transmute( log_rank=log( row_number() ), log_freq=log( n / sum(n) ), colour=&quot;Known UDHR Language&quot; ) %&gt;%<br />
	ungroup</p>
<p>voynich_udhr_corpus &lt;-<br />
	bind_rows( udhr_corpus, voynich_corpus )</p>
<p>voynich_udhr_frequency_plot &lt;-<br />
	voynich_udhr_corpus %&gt;%<br />
	ggplot( aes( x=log_rank, y=log_freq, colour=colour) ) +<br />
	geom_point( alpha=0.4, shape=19 ) +<br />
	scale_color_manual( values=c( &quot;Known UDHR Language&quot; = &quot;#3c3f4a&quot;, &quot;Voynich Text&quot; = &quot;#8a0707&quot; ) ) +<br />
	theme (<br />
			 axis.title.y = element_text( angle = 90, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.y = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.title.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.text.x = element_text( colour=&quot;#3c3f4a&quot;, family=&quot;main_font&quot;, size=12 ),<br />
			 axis.line.x = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 axis.line.y = element_line( color = &quot;#3c3f4a&quot; ),<br />
			 plot.title = element_blank(),<br />
			 plot.subtitle = element_blank(),<br />
			 plot.background = element_rect( fill = &quot;transparent&quot; ),<br />
			 panel.background = element_rect( fill = &quot;transparent&quot; ), # bg of the panel<br />
			 panel.grid.major.x = element_blank(),<br />
			 panel.grid.major.y = element_blank(),<br />
			 panel.grid.minor.x = element_blank(),<br />
			 panel.grid.minor.y = element_blank(),<br />
			 legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			 legend.title = element_blank(),<br />
			 legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			 legend.position=c(.85,.5)<br />
			 ) +<br />
	labs( y=&quot;Log Frequency&quot;,<br />
			x=&quot;Log Rank&quot; )</p>
<p>theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Voynich Manuscript Rank Frequency Distribution against UNDHR Translations&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40)</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Data from: http://www.voynich.nu | http://unicode.org/udhr/&quot;, fontfamily=&quot;bold_font&quot;, colour = &quot;#3c3f4a&quot;, size=12, hjust=1, x=0.98 )</p>
<p>tgp &lt;-<br />
	plot_grid(title, voynich_udhr_frequency_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) </p>
<p>voynich_udhr_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>save_plot(&quot;output/voynich_udhr_plot.pdf&quot;,<br />
							voynich_udhr_plot,<br />
							base_width = 16,<br />
							base_height = 9,<br />
			           	base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<h1>A Refined Randomness</h1>
<p>The above arguments might at first appear compelling. The surface incomprehensibility of the Voynich Manuscript succumbs to the deep currents of statistical laws, and reveals an underlying pattern amongst the chaos of the text.</p>
<p>Sadly, however, as with all too many arguments in the literature regarding power law distributions arising in nature, there is a complication to this argument that again highlights the difference between proof and the failure to disprove. Certainly, if a power law had proved incompatible with the Voynich Manuscript then we would have doubted its authenticity. With its apparent adherence to such a distribution, however, we have taken only one hesitant step towards confidence.</p>
<p><a href="https://www.tandfonline.com/doi/abs/10.1080/0161-110491892755">Rugg</a> has argued that certain random mechanisms can produce text that adheres to Zipf&#8217;s Law, and has demonstrated a simple mechanical procedure for doing so. A more compelling argument is presented, without reference to the Voynich Manuscript, by <a href="https://doi.org/10.1109/18.165464">Li. (1992)</a><span id='easy-footnote-9-945' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/#easy-footnote-bottom-9-945' title='Li, W. ‘Random Texts Exhibit Zipf’s-Law-like Word Frequency Distribution’. IEEE Transactions on Information Theory 38, no. 6 (November 1992): 1842–45. &lt;a href=&quot;https://doi.org/10.1109/18.165464&quot;&gt;https://doi.org/10.1109/18.165464&lt;/a&gt;.'><sup>9</sup></a></span>, who demonstrates that a text drawn entirely at random from any given alphabet of symbols that includes a space will result in a text adhering to some form of Zipf-like distribution. We cannot hang our confidence on such a slender thread.</p>
<h1>Shifted Parameters</h1>
<p>While Zipf&#8217;s Law has been shown to hold for human language text, and a text that does not demonstrate it is certainly suspect, it is far from being the only telltale statistical property of natural language. We have already briefly examined sequences of repeated words in the text; we will now delve further.</p>
<p>Another curious distortion of human languages is that they demonstrate a preference for shorter words. The precise mechanism that results in this apparently universal property is unclear, but likely relates somehow to efficiency of communication. Regardless of the deeper causality, in most natural language texts there is a markedly higher frequency of short words than longer words.</p>
<p>As demonstrated by <a href="https://doi.org/10.1111/j.0039-3193.2004.00109.x">Sigurd, Eeg-Olofsson, and van Weijer, (2004)</a><span id='easy-footnote-10-945' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/#easy-footnote-bottom-10-945' title='Sigurd, Bengt, Mats Eeg-Olofsson, and Joost van Weijer. ‘Word Length, Sentence Length and Frequency &amp;#8211; Zipf Revisited’. Studia Linguistica 58, no. 1 (April 2004): 37–52. &lt;a href=&quot;https://doi.org/10.1111/j.0039-3193.2004.00109.x&quot;&gt;https://doi.org/10.1111/j.0039-3193.2004.00109.x&lt;/a&gt;.'><sup>10</sup></a></span>, however, the very shortest words are not the most common. Instead, at least for English, Swedish, and German, words between 3 and 5 letters in length dominate. This property can be accurately modelled by an appropriately parameterised <a href="https://en.wikipedia.org/wiki/Gamma_distribution">Gamma distribution</a>.</p>
<p>Notably, and conveniently, this property will not hold for random texts as described above. These purely stochastic texts would be expected to produce, in the long term, a monotonically-decreasing function as word length increases.</p>
<p>To demonstrate this effect, we can simulate a purely random text along the lines discussed by Li, and show its correspondingly naïve descending plot.</p>
<figure id="attachment_949" aria-describedby="caption-attachment-949" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/random_length_frequency_plot.png"><img loading="lazy" decoding="async" data-attachment-id="949" data-permalink="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/random_length_frequency_plot-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/random_length_frequency_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Word-Length Frequency Plot of Randomly-Generated Text" data-image-description="&lt;p&gt;Word-Length Frequency Plot of Randomly-Generated Text&lt;/p&gt;
" data-image-caption="&lt;p&gt;Word-Length Frequency Plot of Randomly-Generated Text&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/random_length_frequency_plot.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/random_length_frequency_plot.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/10/random_length_frequency_plot.png" alt="Word-Length Frequency Plot of Randomly-Generated Text" width="1920" height="1080" class="size-full wp-image-949" /></a><figcaption id="caption-attachment-949" class="wp-caption-text">Word-Length Frequency Plot of Randomly-Generated Text | (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/random_length_frequency_plot.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Word-Length Frequency Plot of Randomly Generated Text</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>random_text_frequencies.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( ggplot2 )<br />
library( tidyverse )<br />
library( magrittr )</p>
<p>library( cowplot )</p>
<p>library( showtext )</p>
<p>library( tidytext )</p>
<p>showtext_auto()</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;unicode_font&quot;, &quot;/usr/share/fonts/TTF/Code2000.ttf&quot;)</p>
<p># Vector of symbols (letters) from which to sample.<br />
# Note that a blank space is included as a symbol.<br />
symbol_vector &lt;-<br />
	c(<br />
	  &quot;a&quot;,&quot;b&quot;,&quot;c&quot;,&quot;d&quot;,&quot;e&quot;,&quot;f&quot;,&quot;g&quot;,&quot;h&quot;,&quot;i&quot;,<br />
	  &quot;j&quot;,&quot;k&quot;,&quot;l&quot;,&quot;m&quot;,&quot;n&quot;,&quot;o&quot;,&quot;p&quot;,&quot;q&quot;,&quot;r&quot;,<br />
	  &quot;s&quot;,&quot;t&quot;,&quot;u&quot;,&quot;v&quot;,&quot;w&quot;,&quot;x&quot;,&quot;y&quot;,&quot;z&quot;,&quot; &quot; )</p>
<p># Create a corpus of the same number of symbols as our transcription<br />
# of the Voynich Manuscript<br />
random_corpus &lt;-<br />
	sample( symbol_vector, size = 16777216, replace = TRUE )  %&gt;%<br />
	paste( collapse=&quot;&quot; ) %&gt;%<br />
	enframe( name=NULL ) %&gt;%<br />
	unnest_tokens( word, value ) </p>
<p># Most common words<br />
message( &quot;Calculating word-length frequency statistics&#8230;&quot; )<br />
random_length_freq &lt;-<br />
	random_corpus %&gt;%<br />
	mutate( length = str_length( word ) ) %&gt;%<br />
	count( length ) %&gt;%<br />
	mutate( freq = n / sum(n) )</p>
<p># Identify the lengths of words individually as a vector for distribution fitting<br />
random_lengths &lt;-<br />
	random_corpus %&gt;%<br />
	mutate( length = str_length( word ) ) %&gt;%<br />
	extract2( &quot;length&quot; )</p>
<p># Plot length frequency curve.<br />
gp &lt;-<br />
	ggplot( random_length_freq, aes( x = length, y = freq ) ) +<br />
	geom_line( colour=&quot;#8a0707&quot; ) +<br />
	geom_text( colour=&quot;#8a0707&quot;, label=&quot;\u2720&quot;, family = &quot;unicode_font&quot;, size=4 ) +<br />
	labs(<br />
		  x = &quot;Word Length&quot;,<br />
		  y = &quot;Frequency&quot; )</p>
<p># Theme plot<br />
gp &lt;-<br />
	gp +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.title = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=22 ),<br />
			plot.subtitle = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.title.x = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.title.y = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.line = element_line( colour=&quot;#3c3f4a&quot; ),<br />
			panel.grid.major.x = element_blank(),<br />
			panel.grid.major.y = element_blank(),<br />
			panel.grid.minor.x = element_blank(),<br />
			panel.grid.minor.y = element_blank()<br />
			) </p>
<p># Remove legend from internal plot<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(<br />
				  &quot;Word Length Frequency Plot of Random Corpus&quot;,<br />
				  fontfamily=&quot;bold_font&quot;,<br />
				  colour = &quot;#3c3f4a&quot;,<br />
				  size=20,<br />
				  hjust=0, vjust=1,<br />
				  x=0.02, y=0.88 ) +<br />
	draw_label(<br />
				  &quot;http://www.weirddatascience.net | @WeirdDataSci&quot;,<br />
				  fontfamily=&quot;main_font&quot;,<br />
				  colour = &quot;#3c3f4a&quot;,<br />
				  size=12,<br />
				  hjust=0, vjust=1,<br />
				  x=0.02, y=0.40 )</p>
<p># Combine plots<br />
tgp &lt;-<br />
	plot_grid(<br />
				 title,<br />
				 gp,<br />
				 ncol=1,<br />
				 rel_heights=c( 0.1, 1 ) ) </p>
<p># Add parchment underlay<br />
parchment_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>save_plot(&quot;output/random_length_frequency_plot.pdf&quot;,<br />
							parchment_plot,<br />
							base_width = 16,<br />
							base_height = 9,<br />
			           	base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<p>As can be seen, the word-length frequency of this random text forms a comfortingly simple exponential curve, with pseudo-words of one letter being by far the most common. It is also notable that, at the far reaches of the probability distribution, this cacophonous experiment will produce words of almost four-hundred letters in length. While adherence to Zipf&#8217;s Law would have misled us into supporting this as an apparently natural language, even a cursory glance at this plot would have convinced us otherwise.</p>
<p>How, then, does the Voynich Manuscript adhere to the expected Gamma distribution of Sigurd et al.? We can employ the excellent <a href="https://cran.r-project.org/package=fitdistrplus">fitdistrplus</a> package to peel back this particular veil.</p>
<figure id="attachment_951" aria-describedby="caption-attachment-951" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_length_frequency_plot.png"><img loading="lazy" decoding="async" data-attachment-id="951" data-permalink="https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/voynich_length_frequency_plot-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_length_frequency_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Voynich Word-Length Frequency with Fitted Gamma Distribution" data-image-description="&lt;p&gt;Voynich Word-Length Frequency with Fitted Gamma Distribution&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Word-Length Frequency with Fitted Gamma Distribution&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_length_frequency_plot.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_length_frequency_plot.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_length_frequency_plot.png" alt="Voynich Word-Length Frequency with Fitted Gamma Distribution" width="1920" height="1080" class="size-full wp-image-951" /></a><figcaption id="caption-attachment-951" class="wp-caption-text">Voynich Word-Length Frequency with Fitted Gamma Distribution | (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/10/voynich_length_frequency_plot.pdf">PDF Version</a>)</figcaption></figure>
<p>The word-length frequency distribution of the Voynich Manuscript clearly demonstrates a preference for four-letter words, not only breaking free of the confines of pure randomness, but also corresponding broadly with observed frequency patterns of the languages tested by Sigurd et al.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Fitting Gamma distribution to Voynich Manuscript corpus</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>voynich_gamma.r</code><br />
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggthemes )<br />
library( showtext )</p>
<p>library( tidytext )</p>
<p>library( fitdistrplus )</p>
<p>library(cowplot)<br />
library(magick)</p>
<p># References:<br />
# &lt;https://plus.maths.org/content/mystery-zipf&gt;<br />
# https://en.wikipedia.org/wiki/Gamma_distribution</p>
<p>font_add( &quot;voynich_font&quot;, &quot;/usr/share/fonts/TTF/weird/voynich/eva1.ttf&quot;)<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf&quot;)<br />
font_add( &quot;unicode_font&quot;, &quot;/usr/share/fonts/TTF/Code2000.ttf&quot;)</p>
<p>showtext_auto()</p>
<p>message( &quot;Reading raw Voynich data&#8230;&quot; )<br />
voynich_tbl &lt;-<br />
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Tokenize<br />
voynich_words &lt;-<br />
	voynich_tbl %&gt;%<br />
	unnest_tokens( word, text ) </p>
<p># Most common words<br />
message( &quot;Calculating Voynich language word-length frequency statistics&#8230;&quot; )<br />
voynich_length_freq &lt;-<br />
	voynich_words %&gt;%<br />
	mutate( length = str_length( word ) ) %&gt;%<br />
	count( length ) %&gt;%<br />
	mutate( freq = n / sum(n) )</p>
<p># Identify the lengths of words individually as a vector for distribution fitting<br />
voynich_lengths &lt;-<br />
	voynich_words %&gt;%<br />
	mutate( length = str_length( word ) ) %&gt;%<br />
	extract2( &quot;length&quot; )</p>
<p># Fit a gamma distribution to the observed frequencies</p>
<p># Extract frequency counts for lengths<br />
gamma_fit &lt;-<br />
	fitdist( voynich_lengths, &quot;gamma&quot; )</p>
<p># Plot length frequency curve.<br />
gp &lt;-<br />
	ggplot( voynich_length_freq, aes( x = length, y = freq ) ) +<br />
	geom_line( aes( colour=&quot;Voynich&quot; ) ) +<br />
	geom_text( colour=&quot;#8a0707&quot;, label=&quot;\u2720&quot;, family = &quot;unicode_font&quot;, size=6 ) +<br />
	labs(<br />
		  x = &quot;Word Length&quot;,<br />
		  y = &quot;Frequency&quot; )</p>
<p># Overlay fitted gamma distribution<br />
gp &lt;-<br />
	gp +<br />
	stat_function(<br />
					  fun=dgamma,<br />
					  args=list(<br />
									shape = gamma_fit$estimate[&quot;shape&quot;],<br />
									rate = gamma_fit$estimate[&quot;rate&quot;] ),<br />
					aes( colour = &quot;Fitted Gamma&quot; ) )</p>
<p># Label gamma line and original data.<br />
# (The &#8216;breaks&#8217; argument to scale_colour_manual reorders the lines manually,<br />
# otherwise they will be ordered alphabetically. This is useful for combined<br />
# plots where there isn&#8217;t a colour factor variable that can be reordered.)<br />
gp &lt;-<br />
	gp +<br />
	scale_colour_manual(<br />
							  values = c(&quot;Fitted Gamma&quot; = &quot;#0b6788&quot;, &quot;Voynich&quot; = &quot;#8a0707&quot; ),<br />
							  breaks = c(&quot;Voynich&quot;, &quot;Fitted Gamma&quot; )<br />
	)</p>
<p>gp &lt;-<br />
	gp +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.background = element_rect(fill = &quot;transparent&quot;, colour = &quot;transparent&quot;),<br />
			plot.title = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=22 ),<br />
			plot.subtitle = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.title.x = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.title.y = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=12 ),<br />
			axis.line = element_line( colour=&quot;#3c3f4a&quot; ),<br />
			legend.text = element_text( family=&quot;bold_font&quot;, colour=&quot;#3c3f4a&quot;, size=10 ),<br />
			legend.title = element_blank(),<br />
			legend.position=c(.85,.5),<br />
			legend.key.width = unit(1.6, &quot;lines&quot;),<br />
			legend.key.height = unit(1.2, &quot;lines&quot;),<br />
			panel.grid.major.x = element_blank(),<br />
			panel.grid.major.y = element_blank(),<br />
			panel.grid.minor.x = element_blank(),<br />
			panel.grid.minor.y = element_blank()<br />
			) </p>
<p># Remove legend from internal plot<br />
theme_set(theme_cowplot(font_size=4, font_family = &quot;main_font&quot; ) )  </p>
<p># Cowplot trick for ggtitle<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(<br />
				  &quot;Word Length Frequency Plot of Voynich Corpus&quot;,<br />
				  fontfamily=&quot;bold_font&quot;,<br />
				  colour = &quot;#3c3f4a&quot;,<br />
				  size=20,<br />
				  hjust=0, vjust=1,<br />
				  x=0.02, y=0.88 ) +<br />
	draw_label(<br />
				  &quot;http://www.weirddatascience.net | @WeirdDataSci&quot;,<br />
				  fontfamily=&quot;main_font&quot;,<br />
				  colour = &quot;#3c3f4a&quot;,<br />
				  size=12,<br />
				  hjust=0, vjust=1,<br />
				  x=0.02, y=0.40 )</p>
<p>data_label &lt;-<br />
	ggdraw() +<br />
	draw_label(<br />
				  &quot;Data: http://www.voynich.nu&quot;,<br />
				  fontfamily=&quot;main_font&quot;,<br />
				  colour = &quot;#3c3f4a&quot;,<br />
				  size=14, hjust=1,<br />
				  x=0.98 ) </p>
<p># Combine plots<br />
tgp &lt;-<br />
	plot_grid(<br />
				 title,<br />
				 gp,<br />
				 data_label,<br />
				 ncol=1,<br />
				 rel_heights=c( 0.1, 1, 0.1 ) ) </p>
<p># Add parchment underlay<br />
parchment_plot &lt;-<br />
	ggdraw() +<br />
	draw_image(&quot;img/parchment.jpg&quot;, scale=1.4 ) +<br />
	draw_plot(tgp)</p>
<p>save_plot(&quot;output/voynich_length_frequency_plot.pdf&quot;,<br />
							parchment_plot,<br />
							base_width = 16,<br />
							base_height = 9,<br />
			           	base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<h1>Twisting Paths</h1>
<p>These analyses can only present a dim outline of the text itself, and we resist the awful temptation to attempt any form of decipherment. Certainly, the evidence here seems convincing enough that the Voynich Manuscript does represent a human language, but the statistics presented here are of little use in such an effort. It is likely, of course, that the most frequent words in the manuscript may, under certain assumptions, correspond to the most common words or particles in many languages &#8212; the definite article, the indefinite article, conjunctions, pronouns, and similar. Without deeper knowledge of the language, however, and with the range of scribing conventions and shortcuts commonplace in texts of the period, these techniques are too limited to do more than tantalise us with what we may never know.</p>
<h1>Credible Conclusions</h1>
<p>Subjecting the text of the Voynich Manuscript to the crude frequency analyses presented here can support, although not prove, the view that the manuscript, regardless of its true content, is not simply random gibberish. Nor is the text likely to be the result of a simple mechanical process designed without knowledge of the statistical patterns of human languages. Neither is it likely to be any form of cryptogram more sophisticated than the simplest ciphers, as these would have tended to compromise the statistical properties that we have observed.</p>
<p>The demonstrable following of Zipf&#8217;s Law, and the adherence to a Gamma distribution of similar shape to known languages, strongly suggests that the text is likely a representation of some natural language.</p>
<p>In the next post we will attempt blindly to wrench more secrets from the text itself through application of modern textual analysis techniques. Until then the Voynich Manuscript remains, silently obscure, beyond the reach of our faltering science.</p>
<h2>Footnotes</h2>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2019/10/28/illuminating-the-illuminated-part-two-ipsa-scientia-potestas-est/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">945</post-id>	</item>
		<item>
		<title>Illuminating the Illuminated Part One: A First Look at the Voynich Manuscript</title>
		<link>https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/</link>
					<comments>https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Thu, 26 Sep 2019 20:15:16 +0000</pubDate>
				<category><![CDATA[bibliophilia]]></category>
		<category><![CDATA[cryptology]]></category>
		<category><![CDATA[linguistics]]></category>
		<guid isPermaLink="false">http://www.weirddatascience.net/?p=883</guid>

					<description><![CDATA[<div class="mh-excerpt">While the world abounds with strange phenomena ripe for analysis in their raw state, there is a peculiar pleasure in scrutinising arcane information curated and obscured by the human mind.

The Voynich Manuscript is one of the most well-known and studied volumes of occult knowledge. The book's most recent history involves its purchase in 1912 by Wilfrid Voynich, a rare book dealer, from a sale of manuscripts by the Society of Jesus at the Villa Mondragone, Frascati.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/" title="Illuminating the Illuminated Part One: A First Look at the Voynich Manuscript">[...]</a>]]></description>
										<content:encoded><![CDATA[<h1>The Voynich Manuscript</h1>
<p>While the world abounds with strange phenomena ripe for analysis in their raw state, there is a peculiar pleasure in scrutinising arcane information curated and obscured by the human mind.</p>
<p>The Voynich Manuscript is one of the most well-known and studied volumes of occult knowledge. The book&#8217;s most recent history involves its purchase in 1912 by Wilfrid Voynich, a rare book dealer, from a sale of manuscripts by the Society of Jesus at the Villa Mondragone, Frascati. Following several fruitless years of attempts to decipher the manusript and discover its origin, or to interest others in it, Wilfrid Voynich died. The book passed through a number of other hands before being donated to Yale University by the noted rare book dealer Hans P. Kraus in 1969. It now resides in Yale&#8217;s <a href="https://beinecke.library.yale.edu/">Beinecke Rare Book and Manuscript Library</a> with the designation <a href="https://brbl-dl.library.yale.edu/vufind/Record/3519597">MS 408</a>.</p>
<p>Written almost entirely in an unknown script, barring a small number of words apparently in Latin and High German, the manuscript is compellingly illustrated with depictions of plants, herbs, human figures, astronomical and astrological symbols. The manuscript has resisted all attempts at interpretation by cryptographers, historians, and linguists.</p>
<figure id="attachment_910" aria-describedby="caption-attachment-910" style="width: 7486px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg"><img loading="lazy" decoding="async" data-attachment-id="910" data-permalink="https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/voynich_folio_178/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg" data-orig-size="7486,3715" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="voynich_folio_178" data-image-description="&lt;p&gt;Voynich Manuscript &amp;#8211; Folio 178&lt;/p&gt;
" data-image-caption="&lt;p&gt;Voynich Manuscript &amp;#8211; Folio 178&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg" src="http://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_folio_178.jpg" alt="Voynich Manuscript Folio 178" width="7486" height="3715" class="size-full wp-image-910" /></a><figcaption id="caption-attachment-910" class="wp-caption-text">Voynich Manuscript &#8211; Folio 178</figcaption></figure>
<p>From a linguistic and cryptographic perspective, this lack of success in interpretation is not surprising. The two-hundred or so folios of the manuscript, while beautifully illuminated, present a sadly limited corpus of text for the purposes of traditional analysis.</p>
<p>In this short series of posts we will subject the Voynich Manuscript to a range of text analysis techniques, delving into its structure, gain horrific insight into its composition, and skeptically assessing its credibility. The manuscript has been subjected to almost fifty years of furtive attempts by cryptographers, including the <a href="https://apps.dtic.mil/dtic/tr/fulltext/u2/a070618.pdf">US National Security Agency</a> and a menagerie of others from the distinguished to the deranged. We will crudely mimic some earlier results, and hopefully add our own confusion to the roiling mass of current research into the Voynich Manuscript.</p>
<h1>Authenticity</h1>
<p>Since its discovery, and throughout the ongoing unsuccessful attempts to decipher its contents, many have questioned the authenticity of the Voynich Manuscript. The theory that the entire book is a hoax, either by contemporary scribes or by more modern players, has been raised repeatedly over the years.</p>
<p>Radiocarbon dating in 2010 <a href="https://uanews.arizona.edu/story/ua-experts-determine-age-of-book-nobody-can-read">asserted that the manuscript&#8217;s parchment likely dates from the early 15th century</a>; the volume of parchment in the manuscript, and its consistency across the document, make it unlikely, although not impossible, that the book is a modern-day hoax.</p>
<p>Other supporting evidence has drawn from early mentions of the manuscript in correspondence. According to <a href="http://www.voynich.nu/index.html">http://www.voynich.nu</a>, which presents a far more detailed and thorough description of the research around the manuscript and its history than we could hope to offer here, the first extant mention of the manuscript can be found in a 1639 <a href="http://www.weirddatascience.net/wp-content/uploads/2019/09/voynich_letter_39a.jpg" alt="1639 letter from Athanasius Kircher in Rome, replying to a letter forwarded from Georgius Barschius of Prague by the mathematician Theodor Moretus.">letter</a> from <a href="https://en.wikipedia.org/wiki/Athanasius_Kircher">Athanasius Kircher</a> in Rome, replying to a letter forwarded from <a href="https://en.wikipedia.org/wiki/Georg_Baresch">Georgius Barschius</a> of Prague by the mathematician <a href="https://en.wikipedia.org/wiki/Theodorus_Moretus">Theodor Moretus</a>.</p>
<p>The letter refers to a <a href="http://www.voynich.nu/letters.html">&#8220;book of mysterious steganography&#8221;</a> (<em>&#8220;libellum&#8230; &#8230;steganographici mysterisi&#8221;</em>) illustrated with pictures of plants, stars and chemical secrets that Kirscher had not yet had time to decipher. Barschius had sought out Kirscher&#8217;s expertise due to his fame at the time for claiming to have, erroneously as it later transpired, deciphered the hieroglyphic writing system of the Ancient Egyptian language. Later correspondence between Barschius and Kirscher appears, according to Zandbergen<span id='easy-footnote-11-883' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/#easy-footnote-bottom-11-883' title='&lt;a href=&quot;http://www.voynich.nu/letters.html&quot;&gt;http://www.voynich.nu/letters.html&lt;/a&gt;'><sup>11</sup></a></span>, to suggest strongly that the mysterious book in question is the Voynich Manuscript based on its description.</p>
<h1>A Statistical Argument: Zipf&#8217;s Law</h1>
<p>We now turn from historical sources to darker, more statistical realms. There is compelling support for the notion that, regardless of the true meaning of the book, its contents are drawn from a human language and are neither random symbols nor any form of sophisticated cipher.</p>
<p>One of the pillars of this argument is that certain statistical properties of the Voynich Manuscripts text strongly resemble those of natural, human languages, and which are unlikely, although not impossible, to arise from random text, artificially generated text, or most forms of encipherment.</p>
<p>The most well-known of these statistical properties is the apparent adherence of the manuscript to <a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf&#8217;s Law</a>. This law, <a href="https://mitpress.mit.edu/books/psycho-biology-language">made famous by</a> the US linguist <a href="https://en.wikipedia.org/wiki/George_Kingsley_Zipf">George Zipf</a>, observes that in corpora of natural languages, the frequency of a word is inversely proportional to its rank when words from a corpus are ordered by frequency. More plainly: the most common word in a language is likely to be <em>n</em> times more common than the second most common word; the second word will be roughly <em>n</em> times more common than the third word, and so on. Whilst merely an approximation, this law can be seen to hold for most human languages, and for a range of other natural phenomena.</p>
<p>Random gibberish, on the other hand, would most likely not follow Zipf&#8217;s Law, although carefully crafted gibberish certainly could. <a href="https://www.tandfonline.com/doi/abs/10.1080/0161-110491892755">Rugg</a> has demonstrated that a simple mechanical procedure can produce randomised text that adhered to Zipf&#8217;s Law, although the example he provides is both somewhat contrived and also presupposes a knowledge of this statistical quirk of human languages in the first place. Given that the physical makeup of the Voynich Manuscript dates to the early 15th Century, some four centuries before Zipf popularised this mathematical assessment of human languages, the argument that it is a contemporary act of calligraphic glossolalia seems strained.</p>
<p>Similarly, most forms of cryptography beyond the simplest <a href="https://en.wikipedia.org/wiki/Substitution_cipher">substitution ciphers</a> would also skew the text away from Zipf&#8217;s Law. It is notable that the Voynich Manuscript predates even works such as <a href="https://en.wikipedia.org/wiki/Johannes_Trithemius">Trithemius</a>&#8216;s <a href="https://en.wikipedia.org/wiki/Steganographia">Steganographia</a>, or the <a href="https://en.wikipedia.org/wiki/Book_of_Soyga">Book of Soyga</a> and its <a href="https://link.springer.com/chapter/10.1007/1-4020-4246-9_10">magic tables</a> of letters that so obsessed <a href="https://en.wikipedia.org/wiki/John_Dee">John Dee</a>.</p>
<p>In contrast, however, it has been claimed that other features of the text raise doubts. One of the most commonly stated counter-arguments to the natural hypothesis of the Voynich text is that some words are repeated an unnatural number of times. Depending on the transcription, individual words have been reported to be repeated up to five times. Whilst this is not an impossible occurrence in human lanugage, it is highly irregular.</p>
<p>The next post in this short series will focus on the Voynich Manuscript&#8217;s adherence, or lack thereof, to Zipf&#8217;s Law in full. Following that, we will see the extent to which other forms of modern textual analysis can be applied to dissect the arcane and unrelenting secrets of MS 408.</p>
<p>This post, however, will describe the contortions required to render the Voynich text suitable for our particular form of scrutiny.</p>
<h1>Assumptions</h1>
<p>Given the format and presentation of the text, we make several assumptions about the writing system contained in the Voynich Manuscript:</p>
<ul>
<li>It is written in an alphabet, or potentially an <a href="https://en.wikipedia.org/wiki/Abjad">abjad</a> or even an <a href="https://en.wikipedia.org/wiki/Abugida">abugida</a><span id='easy-footnote-12-883' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/#easy-footnote-bottom-12-883' title='Indeed, the &lt;a href=&quot;https://en.wikipedia.org/wiki/Ge%CA%BDez&quot;&gt;Ge&amp;#8217;ez language&lt;/a&gt;, from which the term &lt;em&gt;abugida&lt;/em&gt; was derived, has at various times been proposed as a candidate for the source language of the Voynich Manuscript'><sup>12</sup></a></span>, and not a <a href="https://en.wikipedia.org/wiki/Logogram">logographic system</a>. That the text is not logographic is justified by the small number of individual symbols. The distinction between the other systems is sufficiently subtle that it will not affect our analyses<span id='easy-footnote-13-883' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/#easy-footnote-bottom-13-883' title='In fact, most of the analyses we perform would also function in a logographic system.'><sup>13</sup></a></span>.</li>
<li>The manuscript is written from left to right, and not the reverse, vertically, <a href="https://en.wikipedia.org/wiki/Boustrophedon">boustrophedon</a>. This is uncontroversial and apparent from even a cursory inspection of the text itself; the horizontal flow of the writing is clear, with lines clearly starting at the left margin and ending before the right. The text is separated into paragraphs, of which the final line is justified to the left.</li>
</ul>
<h1>Data</h1>
<p>Due to the diligent activity of several generations of Voynich researchers, the text of the manuscript has been transcribed into a machine-readable format. As the alphabet is unknown, there are minor uncertainties in rendering the text, leading to a number of similar but competing transcriptions. The subtle details of the various transcription efforts, and their history, are available at: <a href="http://www.voynich.nu/transcr.html">http://www.voynich.nu/transcr.html</a>, with the raw data available at <a href="http://www.voynich.nu/data/">http://www.voynich.nu/data/</a>. We have settled on the <a href="http://www.voynich.nu/transcr.html#v101">v101</a> transliteration by Glen Claston, rendered in the <a href="https://www.voynich.nu/data/IVTFF_format.pdf">Intermediate Voynich Transliteration File Format</a> (IVTFF) of Zandbergen. This is one of the more recent and widely-used transcriptions, and has the added advantage of being supported by the availability of a <a href="http://www.voynich.nu/roadmap.html#fonts">TrueType font</a>. The underlying file is available here: <a href="http://www.voynich.nu/data/GC_ivtff_s.txt"> http://www.voynich.nu/data/GC_ivtff_s.txt</a>.</p>
<h1>Crude Manipulations</h1>
<p>We perform the following steps to make the data usable for our analyses. For many scenarios, we would develop a generalisable set of steps to allow conversion of many documents to an appropriate form. Until and unless, however, a new cache of documents in the same language are found, it is simpler and easier to perform these one-time steps manually.</p>
<p>Firstly, we delete from the text all incomplete words, as marked in the IVTFF format. This includes:</p>
<ul>
<li>all text in angle brackets</li>
<li>all words containing ?&#8217;s</li>
<li>all words containing []</li>
</ul>
<p>Secondly, we tokenize the text and remove punctuation. The transcription of the Voynich manuscript that we have chosen uses the following punctuation:</p>
<ul>
<li>&#8220;.&#8221; is a space</li>
<li>&#8220;,&#8221; is a potential space. For simplicity, we do not treat these as a space.</li>
</ul>
<p>Finally, we organize the document in an appropriate form to be imported into an R data frame, or tidyverse tibble.</p>
<p>The above steps were performed in the Vim text editor, and the commands used are reproduced in the code below:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show Vim text manipulation commands.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
[code]
<p># Delete all commented lines<br />
:\%g/\^\#.\*/d</p>
<p># Remove blank lines<br />
:\%g/\^\$/d</p>
<p># Remove &quot;,&quot; &#8212; assume that potential spaces are /not/ spaces.<br />
:\%s/,//g</p>
<p># Replace each folio&#8217;s page marker (initial for each page)<br />
# with its contents, followed by a comma. (&lt;f1r&gt; -\&gt; f1r,)<br />
:\%s/\^&lt;fRos&gt;\s*\&lt;.{-}\&gt;\$/\rfRos,/<br />
:\%s/\^\&lt;(f\[0-9\]*\[r\|v\]\[0-9\]*)\&gt;\s*\&lt;.{-}\&gt;\$/\r\1,/</p>
<p># Remove all \&lt;\&gt; entries (non-greedy)<br />
:\%s/\&lt;.{-}\&gt;\s*//g</p>
<p># Join all paragraphs (all newlines followed by a character<br />
# other than a newline are removed).<br />
:\%s/\n(\[\^\\n\])/.\\1/<br />
:\%s/\^.f/f/<br />
:\%s/,./,/</p>
<p># Replace &quot;high ascii&quot; rare characters from the IVTFF with their<br />
# ASCII representation. (&lt;http://www.voynich.nu/img/extra/v101a.jpg&gt;)<br />
:\%s/@(\[0-9\]{-});/=nr2char(submatch(1))/g</p>
<p># Replace full stops with spaces<br />
:\%s/./ /g</p>
[/code]
</div></div>
</div>
<p>The resulting raw data file is available <a href="http://www.weirddatascience.net/wp-content/uploads/2019/09/GC_ivtff_s-processed.txt">here</a>. This file can be read into R simply by use of the <code>read.csv</code> function:</p>
<pre class="brush: r; title: ; notranslate">
voynich_tbl &lt;- 
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%
	rename( folio = X1, text = X2 )
</pre>
<p>As a first, horrifying glance into the forms of analysis that this allows, we can now use our raw data to identify the most repeated words in the manuscript, according to our transcription. The following R code extracts the entirety of the text and encodes it as a <a href="https://en.wikipedia.org/wiki/Run-length_encoding">run length encoding</a>. This conveniently results in a sequential list of words and the number of times that each is repeated <em>in sequence</em>. We can then simply extract the largest number of repetitions for each word in the corpus:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Count longest word repetition sequences in the Voynich Manuscript.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
[code language=&#8221;r&#8221;]
<p>library( tidyverse )<br />
library( magrittr )</p>
<p># Count the number of repeated words in the Voynich Manuscript text.</p>
<p># Load the raw data<br />
voynich_tbl &lt;-<br />
	read_csv( &quot;data/voynich_raw.txt&quot;, col_names=FALSE ) %&gt;%<br />
	rename( folio = X1, text = X2 )</p>
<p># Extract the text as a vector of words<br />
voynich_vector &lt;-<br />
	voynich_tbl %&gt;%<br />
	extract2( &quot;text&quot; ) %&gt;%<br />
	paste( sep=&quot; &quot;, collapse=&quot; &quot; ) %&gt;%<br />
	str_split( &quot; &quot; ) %&gt;%<br />
	unlist</p>
<p># Create a run length encoding object from the vector<br />
voynich_rle &lt;-<br />
	voynich_vector %&gt;%<br />
	rle</p>
<p># Convert rle object to a data frame and report the maximum number of repeated<br />
# cases for each word<br />
voynich_repetitions &lt;-<br />
	voynich_rle %&gt;%<br />
	unclass %&gt;%<br />
	as.data.frame %&gt;%<br />
	group_by( values ) %&gt;%<br />
	summarise( max_repetitions = max( lengths ) ) %&gt;%<br />
	ungroup %&gt;%<br />
	arrange( desc( max_repetitions ) )</p>
[/code]
</div></div>
</div>
<p>This simple analysis shows that, in the transcription we have chosen, the longest sequences of repeated words are only three words in length, occuring a total of five times in the text. While there are many other arguments against the potential validity of the Voynich Manuscript, word repetition does in itself present a compelling reason to doubt that the text is a human language.</p>
<p>We have now reduced the strange and beautiful elegance of the Voynich Manuscript&#8217;s centuries-old illuminations to a crude, utilitarian abstraction. With this particular act of artistic and literary desecration complete, in the next post we will examine Zipf&#8217;s Law in more detail, and interrogate the extent to which this law supports or undermines the text&#8217;s authenticity.</p>
<h2>Footnotes</h2>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2019/09/26/illuminating-the-illuminated-a-first-look-at-the-voynich-manuscript/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">883</post-id>	</item>
		<item>
		<title>Bayes vs. the Invaders! Part Four: Convergence</title>
		<link>https://www.weirddatascience.net/2019/04/28/bayes-vs-the-invaders-part-four-convergence/</link>
					<comments>https://www.weirddatascience.net/2019/04/28/bayes-vs-the-invaders-part-four-convergence/#respond</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Sun, 28 Apr 2019 08:38:43 +0000</pubDate>
				<category><![CDATA[beyond the veil]]></category>
		<category><![CDATA[stan]]></category>
		<category><![CDATA[ufo]]></category>
		<guid isPermaLink="false">http://www.weirddatascience.net/?p=728</guid>

					<description><![CDATA[<div class="mh-excerpt">In the previous three posts in our series delving into the cosmic horror of UFO sightings in the United States, we have descended from the deceptively warm and sunlit waters of basic linear regression, through the increasingly frigid, stygian depths of Bayesian inference, generalised linear models, and the probabilistic programming language Stan. In this final post we will explore the implications of the murky realms in which we find ourselves, and consider the awful choices that have led us to this point. We will therefore look, with merciful brevity, at the foul truth revealed by our models, but also consider the arcane philosophies that lie sleeping beneath.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2019/04/28/bayes-vs-the-invaders-part-four-convergence/" title="Bayes vs. the Invaders! Part Four: Convergence">[...]</a>]]></description>
										<content:encoded><![CDATA[<h1>Sealed and Buried</h1>
<p>In the previous three posts<span id='easy-footnote-14-728' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/04/28/bayes-vs-the-invaders-part-four-convergence/#easy-footnote-bottom-14-728' title='To immerse yourself in the full horror of our journey, the previous posts in this series are here: &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.weirddatascience.net/index.php/2019/04/03/bayes-vs-the-invaders-part-one-the-37th-parallel/&quot;&gt;Bayes vs. the Invaders! Part One: The 37th Parallel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.weirddatascience.net/index.php/2019/04/08/bayes-vs-the-invaders-part-two-abnormal-distributions/&quot;&gt;Bayes vs. the Invaders! Part Two: Abnormal Distributions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.weirddatascience.net/index.php/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/&quot;&gt;Bayes vs. the Invaders! Part Three: The Parallax View&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
'><sup>14</sup></a></span> in our series delving into the cosmic horror of UFO sightings in the US, we have descended from the deceptively warm and sunlit waters of basic linear regression, through the increasingly frigid, stygian depths of Bayesian inference, generalised linear models, and the probabilistic programming language Stan.</p>
<p>In this final post we will explore the implications of the murky realms in which we find ourselves, and consider the awful choices that have led us to this point. We will therefore look, with merciful brevity, at the foul truth revealed by our models, but also consider the arcane philosophies that lie sleeping beneath.</p>
<h1>Deviant Interpretations</h1>
<p>Our crazed wanderings through dark statistical realms have led us eventually to a <a href="https://www.weirddatascience.net/index.php/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/">varying slope, varying intercept negative binomial generalised linear model</a>, whose selection was justified over its simpler cousins via leave-one-out cross-validation (LOO-CV). By interrogating the range of hyperparameters of this model, we could reproduce an alluringly satisfying visual display of the posterior predictive distribution across the United States:</p>
<figure id="attachment_703" aria-describedby="caption-attachment-703" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.png"><img loading="lazy" decoding="async" data-attachment-id="703" data-permalink="https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/predictive_plot-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="bvi03-predictive_plot" data-image-description="" data-image-caption="&lt;p&gt;Varying intercept and slope negative binomial GLM of UFO sightings against population.&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-1024x576.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.png" alt="" width="1920" height="1080" class="size-full wp-image-703" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-64x36.png 64w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-703" class="wp-caption-text">Varying intercept and slope negative binomial GLM of UFO sightings against population. (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.pdf">PDF Version</a>)</figcaption></figure>
<p>Further, our model provides us with insight into the individual per-state intercept &#92;(\alpha&#92;) and slope &#92;(\beta&#92;) parameters of the underlying linear model, demonstrating that there is variation between the rate of sightings in US states that cannot be accounted for by their ostensibly human population.</p>
<figure id="attachment_705" aria-describedby="caption-attachment-705" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png"><img loading="lazy" decoding="async" data-attachment-id="705" data-permalink="https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/ufo_per-state_intercepts-slopes/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ufo_per-state_intercepts-slopes" data-image-description="" data-image-caption="&lt;p&gt;Varying slope and intercept negative binomial GLM parameter plot.&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-1024x576.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png" alt="" width="1920" height="1080" class="size-full wp-image-705" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-64x36.png 64w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-705" class="wp-caption-text">Varying slope and intercept negative binomial GLM parameter plot for UFO sightings model. (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.pdf">PDF Version</a>)</figcaption></figure>
<p>Interpreting these parameters, however, is not as quite as simple as in a basic linear model<span id='easy-footnote-15-728' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/04/28/bayes-vs-the-invaders-part-four-convergence/#easy-footnote-bottom-15-728' title='One advantage of generalised linear models is that they allow flexibility whilst remaining relatively interpretable. As models become more complex, extending generalised linear models (GLMs) through generalised &lt;em&gt;additive&lt;/em&gt; models (GAMs), and further into the vast chaotic mass of machine learning approaches such as random forests and convolutional neural networks, there is a general tradeoff between the terrifying power of the modelling technique and the ability of the human mind to &lt;a href=&quot;TODO:call_of_cthulhu&quot;&gt;correlate its contents&lt;/a&gt;.'><sup>15</sup></a></span>. Most importantly our negative binomial GLM employs a <em>log link</em> function to relate the linear model to the data:</p>

$$\begin{eqnarray}<br />
y &amp;\sim&amp; \mathbf{NegBinomial}(\mu, \phi)&#92;&#92;<br />
\log(\mu) &amp;=&amp; \alpha + \beta x&#92;&#92;<br />
\alpha &amp;\sim&amp; \mathcal{N}(0, 1)&#92;&#92;<br />
\beta &amp;\sim&amp; \mathcal{N}(0, 1)&#92;&#92;<br />
\phi &amp;\sim&amp; \mathbf{HalfCauchy}(2)<br />
\end{eqnarray}$$</p>
<p>In a basic linear regression, &#92;(y=\alpha+\beta x&#92;), the &#92;(\alpha&#92;) parameter can be interpreted as the value of &#92;(y&#92;) when &#92;(x&#92;) is 0. Increasing the value of &#92;(x&#92;) by 1 results in a change in the &#92;(y&#92;) value of &#92;(\beta&#92;). We have, however, been drawn far beyond such naive certainties.</p>
<p>The &#92;(\alpha&#92;&#41; and &#92;(\beta&#92;) coefficients in our negative binomial GLM produce the &#92;(\log&#92;) of the &#92;(y&#92;) value: the <em>mean</em> of the negative binomial in our parameterisation.</p>
<p>With a simple rearrangement, we can being to understand the grim effects of this transformation:</p>
<p>$$\begin{array}<br />
_ &amp; \log(\mu) &amp;=&amp; \alpha + \beta x&#92;&#92;<br />
\Rightarrow &amp;\mu &amp;=&amp; \operatorname{e}^{\alpha + \beta x}&#92;&#92;<br />
\end{array}$$</p>
<p>If we set &#92;(x=0&#92;):</p>
<p>$$\begin{eqnarray}<br />
\mu_0 &amp;=&amp; \operatorname{e}^{\alpha}<br />
\end{eqnarray}$$</p>
<p>The mean of the negative binomial when &#92;(x&#92;) is 0 is therefore &#92;(\operatorname{e}^{\alpha&#125;&#92;). If we increase the value of &#92;(x&#92;) by 1:</p>
<p>$$\begin{eqnarray}<br />
\mu_1 &amp;=&amp; \operatorname{e}^{\alpha + \beta}&#92;&#92;<br />
&amp;=&amp; \operatorname{e}^{\alpha} \operatorname{e}^{\beta}<br />
\end{eqnarray}$$</p>
<p>Which, if we recall the definition of the underlying mean of our model&#8217;s negative binomial, &#92;(\mu_0&#92;), above, is:<br />
$$\mu_0 \operatorname{e}^{\beta}$$</p>
<p>The effect of an increase in &#92;(x&#92;) is therefore <em>multiplicative</em> with a log link: each increase of &#92;(x&#92;) by 1 causes the mean of the negative binomial to be further multiplied by &#92;(\operatorname{e}^{\beta}&#92;).</p>
<p>Despite this insidious complexity, in many senses our naive interpretation of these values still holds true. A higher value for the &#92;(\beta&#92;) coefficient does mean that the rate of sightings increases more swiftly with population.</p>
<p>With the full, unsettling panoply of US States laid out before us, any attempt to elucidate their many and varied deviations would be overwhelming. Broadly, we can see that both slope and intercepts are generally restricted to a fairly close range, with the 50% and 95% credible intervals notably overlapping in many cases. Despite this, there are certain unavoidable abnormalities from which we cannot, must not, shrink:</p>
<ul>
<li> Only Pennsylvania presents a slope (\(\beta\)) parameter that could be considered as potentially zero, if we consider its 95% credible interval. The correlation between population and number of sightings is otherwise unambiguously positive.</li>
<li>Delaware, whilst presenting a wide credible interval for its slope (\(\beta\)) parameter, stands out as suffering from the greatest rate of change in sightings as its population increases.</li>
<li>Both California and Utah, present suspiciously narrow credible intervals on their slope (\(\beta\)) parameters. The growth in sightings as the population increases therefore demonstrates a worrying consistency although, in both cases, this rate is amongst the lowest of all the states.</li>
</ul>
<p>We can conclude, then, that while the <em>total</em> number of sightings in Delaware are currently low, any increase in numbers of residents there appears to possess a strange fascination for visitors from beyond the inky blackness of space. By contrast, whilst our alien observers have devoted significant resources to monitoring Utah and California, their apparent willingness to devote further effort to tracking those states&#8217; burgeoning populations is low.</p>
<h1>Trembling Uncertainty</h1>
<p>One of the fundamental elements of the Bayesian approach is its willing embrace of uncertainty. The output of our eldritch inferential processes are not <em>point estimates</em> of the outcome, as in certain other approaches, but instead <em>posterior predictive distributions</em> for those outcomes. As such, if when we turn our minds to predicting new outcomes based on previously unseen data, our outcome is a <em>distribution</em> over possible values rather than a single estimate. Thus, at the dark heart of Bayesian inference is a belief in the truth that all uncertainty be quantified as probability distributions.</p>
<p>The Bayesian approach as inculcated here has a <em>predictive</em> bent to it. These intricate methods lend themselves to forecasting a distribution of possibilities before the future unveils itself. Here, we gain a horrifying glimpse into the emerging occurrence of alien visitations to the US as its people <a href="https://www.goodreads.com/work/quotes/3194841-the-war-of-the-worlds">busy themselves about their various concerns, scrutinised and studied, perhaps almost as narrowly as a man with a microscope might scrutinise the transient creatures that swarm and multiply in a drop of water</a>.</p>
<h1>Unavoidable Choices</h1>
<p>The twisted reasoning underlying this series of posts has been not only in indoctrinating others into the hideous formalities of Bayesian inference, probabilistic programming, and the arcane subtleties of the <a href="https://www.mc-stan.org">Stan</a> programming language; but also as an exercise in exposing our own minds to their horrors. As such, there is a tentative method to the madness of some of the choices made in this series that we will now elucidate.</p>
<p>Perhaps the most jarring choice has been to code these models in Stan directly, rather than using one of the excellent helper libraries that allow for more concise generation of the underlying Stan code. Both <a href="https://cran.r-project.org/package=brms"><code>brms</code></a> and <a href="https://cran.r-project.org/package=rstanarm"><code>rstanarm</code></a> possess the capacity to spawn models such as ours with greater simplicity of specification and efficiency of output, due to a number of arcane tricks. As an exercise in internalising such forbidden knowledge, however, it is useful to address reality unshielded by such swaddling conveniences.</p>
<p>In fabricating models for more practical reasons, however, we would naturally turn to these tools unless our unspeakable demands go beyond their natural scope. As a personal choice, <code>brms</code> is appealing due to its more natural construction of readable per-model Stan code to be compiled. This allows for the grotesque internals of generated models to be inspected and, if required, twisted to whatever form we desire. <code>rstanarm</code>, by contrast, avoids per-model compilation by pre-compiling more generically applicable models, but its underlying Stan code is correspondingly more arcane for an unskilled neophyte.</p>
<p>The Stan models presented in previous posts have also been constructed as simply as possible and have avoided all but the most universally accepted tricks for improving speed and stability<span id='easy-footnote-16-728' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/04/28/bayes-vs-the-invaders-part-four-convergence/#easy-footnote-bottom-16-728' title='The most important of these are to center and scale our predictors, and to operate on the log scale through use of the &lt;code&gt;log&lt;/code&gt; form of the probability distribution sampling statements. Both of these contribute significantly to speed and numerical stability in sampling, and are sufficiently universal that their inclusion seemed justified.'><sup>16</sup></a></span>. Most notably, Stan presents specific functions for GLMs based on the Poisson and negative binomial distributions that apply standard link functions directly. As mentioned, we consider it more useful for personal and public indoctrination to use the basic, albeit <code>log</code>-form parameterisations.</p>
<h1>Last Rites</h1>
<p>In concluding the dark descent of this series of posts on Bayesian inference, generalised linear models, and the unearthly effects of extraterrestrial visitions on humanity, we have applied numerous esoteric techniques to identify, describe, and quantify the relationship between human population and UFO sightings. The enigmatic model constructed throughout this and the previous three entries darkly implies that, while the rate of inexplicable aerial phenomena is inextricably and positively linked to humanity&#8217;s unchecked growth, there are nonetheless unseen factors that draw our non-terrestrial visitors to certain populations more than others, and that their focus and attention is ever more acute.</p>
<p>This series has inevitably fallen short of a full and meaningful elucidation of the techniques of Bayesian inference and Stan. From this first step on such a path, then, interested students of the bizarre and arcane would be well advised to draw on the following esoteric resources:</p>
<ul>
<li>McElreath&#8217;s <a href="https://xcelab.net/rm/statistical-rethinking/">Statistical Rethinking</a></li>
<li>Gelman et al.&#8217;s <a href="http://www.stat.columbia.edu/~gelman/book/">Bayesian Data Analysis</a></li>
<li><a href="https://mc-stan.org/">Stan Manual and Tutorials</a></li>
</ul>
<p>Until then, watch the skies and archive your data.</p>
<h2>Footnotes</h2>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2019/04/28/bayes-vs-the-invaders-part-four-convergence/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">728</post-id>	</item>
		<item>
		<title>Bayes vs. the Invaders! Part Three: The Parallax View</title>
		<link>https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/</link>
					<comments>https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/#comments</comments>
		
		<dc:creator><![CDATA[moth]]></dc:creator>
		<pubDate>Wed, 17 Apr 2019 13:35:56 +0000</pubDate>
				<category><![CDATA[beyond the veil]]></category>
		<category><![CDATA[stan]]></category>
		<category><![CDATA[ufo]]></category>
		<guid isPermaLink="false">http://www.weirddatascience.net/?p=654</guid>

					<description><![CDATA[<div class="mh-excerpt">In the <a href="http://www.weirddatascience.net/index.php/2019/04/08/bayes-vs-the-invaders-part-two-abnormal-distributions/">previous post</a> of this series unveiling the relationship between UFO sightings and population, we crossed the threshold of normality underpinning linear models to construct a <em>generalised linear model</em> based on the more theoretically satisfying Poisson distribution. On inspection, however, this model revealed itself to be less well suited to the data than we had, in our tragic ignorance, hoped.</div> <a class="mh-excerpt-more" href="https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/" title="Bayes vs. the Invaders! Part Three: The Parallax View">[...]</a>]]></description>
										<content:encoded><![CDATA[<h1>The Parallax View</h1>
<p>In the <a href="http://www.weirddatascience.net/index.php/2019/04/08/bayes-vs-the-invaders-part-two-abnormal-distributions/">previous post</a> of this series unveiling the relationship between UFO sightings and population, we crossed the threshold of normality underpinning linear models to construct a <em>generalised linear model</em> based on the more theoretically satisfying Poisson distribution.</p>
<p>On inspection, however, this model revealed itself to be less well suited to the data than we had, in our tragic ignorance, hoped. While it appeared, on visual inspection, to capture some features of the data, the predictive posterior density plot demonstrated that it still fell short of addressing the subtleties of the original.</p>
<p>In this post, we will seek to overcome this sad lack in two ways: firstly, we will subject our models to pitiless mathematical scrutiny to assess their ability to describe the data. With our eyes irrevocably opened to these techniques, we will construct an ever more complex <a href="https://www.youtube.com/watch?v=yRb63jt4uzw">armillary</a> with which to approach the unknowable truth.</p>
<h1>Critical Omissions of Information</h1>
<p>Our previous post showed the different fit of the Poisson model to the data from the simple Gaussian linear model. When presented with a grim array of potential simulacra, however, it is crucial to have reliable and quantitative mechanisms to select amongst them.</p>
<p>The eldritch procedure most suited to this purpose, <em>model selection</em>, in our framework, draws on <em>information criteria</em> that express the <em>relative</em> effectiveness of models at creating sad mockeries of the original data. The original and most well-known such criterion is the <a href="https://en.wikipedia.org/wiki/Akaike_information_criterion"><em>Akaike Information Criterion</em></a>, which has, in turn, spawned a multitude of successors applicable in different situations and with different properties. Here, we will make use of <a href="https://mc-stan.org/loo/"><em>Leave-One-Out Cross Validation</em></a> (LOO-CV)<span id='easy-footnote-17-654' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/#easy-footnote-bottom-17-654' title='Specifically, leave-one-out cross validation calculated via &lt;a href=&quot;https://arxiv.org/abs/1507.04544&quot;&gt;&lt;em&gt;Pareto-smoothed importance sampling&lt;/em&gt;&lt;/a&gt;.'><sup>17</sup></a></span> as the most applicable to the style of model and set of techniques applied here.</p>
<p>It is important to reiterate that these approaches do not speak to an absolute underlying truth; information criteria allow us to choose between models, assessing which has most closely assimilated the madness and chaos of the data. For LOO-CV, this results in an <em>expected log predictive density</em> (<code>elpd</code>) for each model. The model with the lowest <code>elpd</code> is the least-warped mirror of reality amongst those we subject to scrutiny.</p>
<p>There are many fragile subtleties to model selection, of which we will mention only two here. Firstly, in general, the greater the number of predictors or variables incorporated into a model, the more closely it will be able to mimic the original data. This is problematic, in that a model can become <em>overfit</em> to the original data and thus be unable to represent previously unseen data accurately &#8212; it learns to mimic the form of the observed data at the expense of uncovering its underlying reality. The LOO-CV technique avoids this trap by, in effect, withholding data from the model to assess its ability to make accurate inferences on previously unseen data.</p>
<p>The second consideration in model selection is that the information criteria scores of models, such as (<code>elpd</code>) in LOO-CV, are subject to <em>standard error</em> in their assessment; the score itself is not a perfect metric of model performance, but a cunning approximation. As such we will only consider one model to have outperformed its competitors if the difference in their relative <code>elpd</code> is several times greater than this standard error.</p>
<p>With this understanding in hand, we can now ruthlessly quantify the effectiveness of the Gaussian linear model against the Poisson generalised linear model.</p>
<h1>Gaussian vs. the Poisson</h1>
<p>The original model presented before our subsequent descent into horror was a simple linear Gaussian, produced through use of <code>ggplot2</code>&#8216;s <code>geom_smooth</code> function. To compare this meaningfully against the Poisson model of the previous post, we must now recreate this model using the, now hideously familar, tools of Bayesian modelling with Stan.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show Gaussian model specification code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>population_model_normal.stan</code><br />
[code language=&#8221;c&#8221;]
<p>data {</p>
<p>	// Number of rows (observations)<br />
	int&lt;lower=1&gt; observations;</p>
<p>	// Predictor (population of state)<br />
	vector[ observations ] population;</p>
<p>	// Response (counts)<br />
	real&lt;lower=0&gt; counts[observations];</p>
<p>}</p>
<p>parameters {</p>
<p>	// Intercept<br />
	real&lt; lower=0 &gt; a;</p>
<p>	// Slope<br />
	real&lt; lower=0 &gt; b;</p>
<p>	// Standard deviation<br />
	real&lt; lower=0 &gt; sigma;<br />
}</p>
<p>model {</p>
<p>	// Priors<br />
	a ~ normal( 0, 5 );<br />
	b ~ normal( 0, 5 );<br />
	sigma ~ cauchy( 0, 2.5 );</p>
<p>	// Model<br />
	counts ~ normal( a + population * b, sigma );</p>
<p>}</p>
<p>generated quantities {</p>
<p>	// Posterior predictions<br />
	vector[observations] counts_pred;</p>
<p>	// Log likelihood (for LOO)<br />
	vector[observations] log_lik;</p>
<p>	for (n in 1:observations) {</p>
<p>		log_lik[n] = normal_lpdf( counts[n] | a + population[n]*b, sigma );<br />
		counts_pred[n] = normal_rng( a + population[n]*b, sigma );</p>
<p>	}</p>
<p>}</p>
[/code]
</div></div>
</div>
<p>With both models straining in their different directions towards the light, we apply LOO-CV cross validation to assess their effectiveness at predicting the data.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show LOO-CV comparison code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p>LOO-CV R-code snippet. Full code is at the end of this post.<br />
[code language=&#8221;R&#8221;]
<p>&#8230;<br />
# Compare models with LOO<br />
log_lik_normal &lt;- extract_log_lik(fit_ufo_pop_normal, merge_chains = FALSE)<br />
r_eff_normal &lt;- relative_eff(exp(log_lik_normal))<br />
loo_normal &lt;- loo(log_lik_normal, r_eff = r_eff_normal, cores = 2)</p>
<p>log_lik_poisson &lt;- extract_log_lik(fit_ufo_pop_poisson, merge_chains = FALSE)<br />
r_eff_poisson &lt;- relative_eff(exp(log_lik_poisson))<br />
loo_poisson &lt;- loo(log_lik_poisson, r_eff = r_eff_poisson, cores = 2)<br />
&#8230;</p>
[/code]
</div></div>
</div>
<pre class="brush: plain; title: ; notranslate">
&gt; compare( loo_normal, loo_poisson )
elpd_diff        se 
  -8576.1     712.5 
</pre>
<p>The information criterion shows that the complexity of the Poisson model does not, in fact, produce a more effective model than the false serenity of the Gaussian<span id='easy-footnote-18-654' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/#easy-footnote-bottom-18-654' title='This is perhaps not entirely surprising, based on our belief from the first post of this series that we should consider sightings at the state level rather than globally. With over fifty individual count processes fused into an amorphous mass, it is not entirely surprising that the &lt;a href=&quot;https://en.wikipedia.org/wiki/Central_limit_theorem&quot;&gt;Gaussian is a better approximation to the data&lt;/a&gt;.'><sup>18</sup></a></span>. The negative <code>elpd_diff</code> of the <code>compare</code> function supports the first of the two models, and the magnitude being over twelve times greater than the standard error leaves little doubt that the difference is significant. We must, it seems, look further.</p>
<p>With these techniques for selecting between models in hand, then, we can move on to constructing ever more complex attempts to dispel the darkness.</p>
<h1>Trials without End</h1>
<p>The Poisson distribution, whilst appropriate for many forms of count data, suffers from fundamental limits to its understanding. The single parameter of the Poisson, &#92;(\lambda&#92;), enforces that the mean and variance of the data are equal. When such comforting falsehoods wither in the pale light of reality, we must move beyond the gentle chains in which the Poisson binds us.</p>
<p>The next horrific evolution, then, is the <a href="https://en.wikipedia.org/wiki/Negative_binomial_distribution"><em>negative binomial</em></a> distribution, which similarly speaks to count data, but presents a <em>dispersion</em> parameter (&#92;(\phi&#92;)) that allows the variance to exceed the mean<span id='easy-footnote-19-654' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/#easy-footnote-bottom-19-654' title='There are several parameterisations of the negative binomial used for different applications. The distribution itself is often characterised as representing the number of binary trials, such as a coin flip, required before a given result, such as a number of heads, is achieved. The parameterisation we use here is a common way to represent &lt;em&gt;overdispersed&lt;/em&gt; Poisson data.'><sup>19</sup></a></span>.</p>
<p>With our arcane theoretical library suitably expanded, we can now transplant the still-beating Poisson heart of our earlier generalised linear model with the more complex machinery of the negative binomial:</p>

$$\begin{eqnarray}<br />
y &amp;\sim&amp; \mathbf{NegBinomial}(\mu, \phi)&#92;&#92;<br />
\log(\mu) &amp;=&amp; \alpha + \beta x&#92;&#92;<br />
\alpha &amp;\sim&amp; \mathcal{N}(0, 1)&#92;&#92;<br />
\beta &amp;\sim&amp; \mathcal{N}(0, 1)&#92;&#92;<br />
\phi &amp;\sim&amp; \mathbf{HalfCauchy}(2)<br />
\end{eqnarray}$$</p>
<p>As with the Poisson, our negative binomial generalised linear model employs a log link function to transform the linear predictor. The Stan code for this model is given below.</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show negative binomial model specification code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>population_model_negbinomial.stan</code><br />
[code language=&#8221;c&#8221;]
<p>data {</p>
<p>	// Number of rows (observations)<br />
	int&lt;lower=1&gt; observations;</p>
<p>	// Predictor (population of state)<br />
	vector[ observations ] population_raw;</p>
<p>	// Response (counts)<br />
	int&lt;lower=0&gt; counts[observations];</p>
<p>}</p>
<p>transformed data {</p>
<p>	// Center and scale the predictor<br />
	vector[ observations ] population;<br />
	population = ( population_raw &#8211; mean( population_raw ) ) / sd( population_raw );</p>
<p>}</p>
<p>parameters {</p>
<p>	// Negative binomial dispersion parameter<br />
	real phi;</p>
<p>	// Intercept<br />
	real a;</p>
<p>	// Slope<br />
	real b;</p>
<p>}</p>
<p>transformed parameters {</p>
<p>	vector&lt;lower=0&gt;[observations] mu;<br />
	mu = a + b*population;</p>
<p>}</p>
<p>model {</p>
<p>	// Priors<br />
	a ~ normal( 0, 1 );<br />
	b ~ normal( 0, 1 );<br />
	phi ~ cauchy( 0, 5 );</p>
<p>	// Model<br />
	// Uses the log version of the neg_binomial_2 to avoid<br />
	// manual exponentiation of the linear predictor.<br />
	// (This avoids numerical problems in the calculations.)<br />
	counts ~ neg_binomial_2_log( mu, phi );</p>
<p>}</p>
<p>generated quantities {</p>
<p>	vector[observations] counts_pred;<br />
	vector[observations] log_lik;</p>
<p>	for (n in 1:observations) {</p>
<p>		log_lik[n] = neg_binomial_2_log_lpmf( counts[n] | mu[n], phi );<br />
		counts_pred[n] = neg_binomial_2_log_rng( mu[n], phi );</p>
<p>	}</p>
<p>}</p>
[/code]
</div></div>
</div>
<p>With this model fit, we can compare its whispered falsehoods against both the original linear Gaussian model and the Poisson GLM:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show LOO-CV comparison code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p>Code snippet for calculating the LOO-CV <code>elpd</code> for three models. The full R code for building and comparing all models is listed at the end of this post.</p>
[code]
[code language=&quot;R&quot;]
&#8230;<br />
# Compare models with LOO<br />
log_lik_normal &lt;- extract_log_lik(fit_ufo_pop_normal, merge_chains = FALSE)<br />
r_eff_normal &lt;- relative_eff(exp(log_lik_normal))<br />
loo_normal &lt;- loo(log_lik_normal, r_eff = r_eff_normal, cores = 2)</p>
<p>log_lik_poisson &lt;- extract_log_lik(fit_ufo_pop_poisson, merge_chains = FALSE)<br />
r_eff_poisson &lt;- relative_eff(exp(log_lik_poisson))<br />
loo_poisson &lt;- loo(log_lik_poisson, r_eff = r_eff_poisson, cores = 2)</p>
<p>log_lik_negbinom &lt;- extract_log_lik(fit_ufo_pop_negbinom, merge_chains = FALSE)<br />
r_eff_negbinom &lt;- relative_eff(exp(log_lik_negbinom))<br />
loo_negbinom &lt;- loo(log_lik_negbinom, r_eff = r_eff_negbinom, cores = 2)<br />
&#8230;<br />
[/code]
</div></div>
</div>
<pre class="brush: plain; title: ; notranslate">
&gt; compare( loo_poisson, loo_negbinom )
elpd_diff        se 
   8880.8     721.9 
</pre>
<p>With the first comparison, it is clear that the sinuous flexibility offered by the dispersion parameter, &#92;(\phi&#92;), of the negative binomial allows that model to mould itself much more effectively to the data than the Poisson. The <code>elpd_diff</code> score is positive, indicating that the second of the two compared models is favoured; the difference is over twelve times the standard error, giving us confidence that the negative binomial model is meaningfully more effective than the Poisson.</p>
<p>Whilst superior to the Poisson, does this adaptive capacity allow the negative binomial model to render the na&iuml;ve Gaussian linear model obsolete?</p>
<pre class="brush: plain; title: ; notranslate">
&gt; compare( loo_normal, loo_negbinom )
elpd_diff        se 
    304.7      30.9 
</pre>
<p>The negative binomial model subsumes the Gaussian with little effort. The <code>elpd_diff</code> is almost ten times the standard error in favour of the negative binomial GLM, giving us confidence in choosing it. From here on, we will rely on the negative binomial as the core of our schemes.</p>
<h1>Overlapping Realities</h1>
<p>The improvements we have seen with the negative binomial model allow us to discard the Gaussian and Poisson models with confidence. It is not, however, sufficient to fill the gaping void induced by our belief that the sightings of abnormal aerial phenomena in differing US states vary differently with their human population.</p>
<p>To address this question we must ascertain whether allowing our models to unpick the individual influence of states will improve their predictive ability. This, in turn, will lead us into the gnostic insanity of <em>hierarchical models</em>, in which we group predictors in our models to account for their shadowy underlying structures.</p>
<h1>Limpid Pools</h1>
<p>The first step on this path is to allow part of the linear function underpinning our model, specifically the intercept value, &#92;(\alpha&#92;), to vary between different US states. In a simple linear model, this causes the line of best fit for each state to meet the y-axis at a different point, whilst maintaining a constant slope for all states. In such a model, the result is a set of parallel lines of fit, rather than a single global truth.</p>
<p>This varying intercept can describe a range of possible phenomena for which the rate of change remains constant, but the baseline value varies. In such <em>hierarchical models</em> we employ a concept known as <em>partial pooling</em> to extract as much forbidden knowledge from the reluctant data as possible.</p>
<p>A set of entirely separate models, such as the per-state set of linear regressions presented in the first post of this series, employs a <em>no pooling</em> approach: the data of each state is treated separately, with an entirely different model fit to each. This certainly considers the uniqueness of each state, but cannot benefit from insights drawn from the broader range of data we have available, which we may reasonably assume to have some relevance.</p>
<p>By contrast, the global Gaussian, Poisson, and negative binomial models presented so far represent <em>complete pooling</em>, in which the entire set of data is considered a formless, protean amalgam without meaningful structure. This mindless, groping approach causes the unique features of each state to be lost amongst the anarchy and chaos.</p>
<p>A partial pooling approach instead builds a <em>global</em> mean intercept value across the dataset, but allows the intercept value for each individual state to deviate according to a governing probability distribution. This both accounts for the individuality of each group of observations, in our case the state, but also draws on the accumulated wisdom of the whole.</p>
<p>We now construct a partially-pooled varying intercept model, in which the parameters and observations for each US state in our dataset is individually indexed:</p>
<p>$$\begin{eqnarray}<br />
y &amp;\sim&amp; \mathbf{NegBinomial}(\mu, \phi)&#92;&#92;<br />
\log(\mu) &amp;=&amp; \alpha_i + \beta x&#92;&#92;<br />
\alpha_i &amp;\sim&amp; \mathcal{N}(\mu_\alpha, \sigma_\alpha)&#92;&#92;<br />
\beta &amp;\sim&amp; \mathcal{N}(0, 1)&#92;&#92;<br />
\phi &amp;\sim&amp; \mathbf{HalfCauchy}(2)<br />
\end{eqnarray}$$</p>
<p>Note that the intercept parameter, &#92;(\alpha&#92;), in the second line is now indexed by the state, represented here by the subscript &#92;(i&#92;). The slope parameter, &#92;(\beta&#92;), remains constant across all states.</p>
<p>This model can be rendered in Stan code as follows:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show Gaussian model specification code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>population_model_negbinomial_var_intercept.stan</code><br />
[code language=&#8221;c&#8221;]
<p>data {</p>
<p>	// Number of rows (observations)<br />
	int&lt;lower=1&gt; observations;</p>
<p>	// Number of states<br />
	int&lt; lower=0 &gt; states;</p>
<p>	// Vector detailing the US state in which each observation (count of<br />
	// counts in a year) occurred<br />
	int&lt; lower=1, upper=states &gt; state[ observations ];</p>
<p>	// Predictor (population of state)<br />
	vector[ observations ] population_raw;</p>
<p>	// Response (counts)<br />
	int&lt;lower=0&gt; counts[ observations ];</p>
<p>}</p>
<p>transformed data {</p>
<p>	// Center and scale the predictor<br />
	vector[ observations ] population;<br />
	population = ( population_raw &#8211; mean( population_raw ) ) / sd( population_raw );</p>
<p>}</p>
<p>parameters {</p>
<p>	// Per-state intercepts<br />
	vector[ states ] a;</p>
<p>	// Mean and SD of distribution from which per-state intercepts are drawn<br />
	real&lt; lower=0 &gt; mu_a;<br />
	real&lt; lower=0 &gt; sigma_a;</p>
<p>	// Negative binomial dispersion parameter<br />
	real&lt; lower=0 &gt; phi;</p>
<p>	// Slope<br />
	real b;</p>
<p>}</p>
<p>transformed parameters {</p>
<p>	// Calculate location parameter for negative binomial incorporating<br />
	// per-state indicator.<br />
	vector[ observations ] eta;</p>
<p>	for( i in 1:observations ) {<br />
		eta[i] = a[ state[i] ] + population[i] * b;<br />
	}<br />
}</p>
<p>model {</p>
<p>	mu_a ~ normal(0, 1);<br />
	sigma_a ~ cauchy(0, 2);</p>
<p>	// Priors<br />
	a ~ normal ( mu_a, sigma_a );<br />
	b ~ normal( 0, 1 );<br />
	phi ~ cauchy( 0, 2 );</p>
<p>	// Model<br />
	counts ~ neg_binomial_2_log( eta, phi );</p>
<p>}</p>
<p>generated quantities {</p>
<p>	vector[observations] counts_pred;<br />
	vector[observations] log_lik;</p>
<p>	vector[observations] mu;<br />
	mu = exp( eta );</p>
<p>	for (n in 1:observations) {</p>
<p>		log_lik[n] = neg_binomial_2_log_lpmf( counts[n] | eta[n], phi );<br />
		counts_pred[n] = neg_binomial_2_log_rng( eta[n], phi );</p>
<p>	}</p>
<p>}</p>
[/code]
</div></div>
</div>
<p>Once the model has twisted itself into the most appropriate form for our data, we can now compare it against our previous completely-pooled model:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show LOO-CV comparison code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p>Code snippet for comparing models via LOO-CV. Full code at the end of this post.</p>
[code]
&#8230;<br />
# Compare models with LOO<br />
log_lik_negbinom_var_intercept &lt;- extract_log_lik(fit_ufo_pop_negbinom_var_intercept, merge_chains = FALSE)<br />
r_eff_negbinom_var_intercept &lt;- relative_eff(exp(log_lik_negbinom_var_intercept))<br />
loo_negbinom_var_intercept &lt;- loo(log_lik_negbinom_var_intercept, r_eff = r_eff_negbinom_var_intercept, cores = 2)<br />
&#8230;<br />
[/code]
</div></div>
</div>
<pre class="brush: plain; title: ; notranslate">
&gt; compare( loo_negbinom, loo_negbinom_var_intercept )
elpd_diff        se 
    363.2      28.8 
</pre>
<p>Our transcendent journey from the statistical primordial ooze continues: the varying intercept model is favoured over the completely-pooled model by a significant margin.</p>
<h1>Sacred Geometry</h1>
<p>Now that our minds have apprehended a startling glimpse of the implications of the varying intercept model, it is natural to consider taking a further terrible step and allowing both the slope and the intercept to vary<span id='easy-footnote-20-654' class='easy-footnote-margin-adjust'></span><span class='easy-footnote'><a href='https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/#easy-footnote-bottom-20-654' title='One can build varying slope models with a fixed intercept, but we will not approach that horror here.'><sup>20</sup></a></span>.</p>
<p>With both the intercept and slope of the underlying linear predictor varying, an additional complexity raises its head: can we safely assume that these parameters, the intercept and slope, vary independently of each other, or may there be arcane correlations between them? Do states with a higher intercept also experience a higher slope in general, or is the opposite the case? Without prior knowledge to the contrary, we must allow our model to determine these possible correlations, or we are needlessly throwing away potential information in our model.</p>
<p>For a varying slope and intercept model, therefore, we must now include a <em>correlation matrix</em>, &#92;(\Omega&#92;), between the parameters of the linear predictor for each state in our model. This correlation matrix, as with all parameters in a Bayesian framework, must be expressed with a prior distribution from which the model can begin its evaluation of the data.</p>
<p>With deference to the authoritative <a href="https://mc-stan.org/docs/2_18/stan-users-guide/">quaint and curious volume of forgotten lore</a> we will use an <a href="https://mc-stan.org/docs/2_18/stan-users-guide/multivariate-hierarchical-priors-section.html">LKJ prior</a> for the correlation matrix without further discussion of the reasoning behind it.</p>
<p>$$\begin{eqnarray}<br />
y &amp;\sim&amp; \mathbf{NegBinomial}(\mu, \phi)&#92;&#92;<br />
\log(\mu) &amp;=&amp; \alpha_i + \beta x_i&#92;&#92;<br />
\begin{bmatrix}<br />
\alpha_i&#92;&#92;<br />
\beta_i<br />
\end{bmatrix} &amp;\sim&amp; \mathcal{N}(<br />
\begin{bmatrix}<br />
\mu_\alpha&#92;&#92;<br />
\mu_\beta<br />
\end{bmatrix}, \Omega )&#92;&#92;<br />
\Omega &amp;\sim&amp; \mathbf{LKJCorr}(2)&#92;&#92;<br />
\phi &amp;\sim&amp; \mathbf{HalfCauchy}(2)<br />
\end{eqnarray}$$</p>
<p>This model has grown and gained a somewhat twisted complexity compared with the serene austerity of our earliest linear model. Despite this, each further step in the descent has followed its own perverse logic, and the progression should clear. The corresponding Stan code follows:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show negative binomial varying intercept and slope model specification code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>population_model_negbinomial_var_intercept_slope.stan</code><br />
[code]
<p>data {</p>
<p>	// Number of rows (observations)<br />
	int&lt;lower=1&gt; observations;</p>
<p>	// Number of states<br />
	int&lt; lower=0 &gt; states;</p>
<p>	// Vector detailing the US state in which each observation (count of<br />
	// counts in a year) occurred<br />
	int&lt; lower=1, upper=states &gt; state[ observations ];</p>
<p>	// Predictor (population of state)<br />
	vector[ observations ] population_raw;</p>
<p>	// Response (counts)<br />
	int&lt;lower=0&gt; counts[ observations ];</p>
<p>}</p>
<p>transformed data {</p>
<p>	// Center and scale the predictor<br />
	vector[ observations ] population;<br />
	population = ( population_raw &#8211; mean( population_raw ) ) / sd( population_raw );</p>
<p>}</p>
<p>parameters {</p>
<p>	// Per-state intercepts and slopes<br />
	vector[ states ] state_intercept;<br />
	vector[ states ] state_slope;</p>
<p>	// Baseline intercept and slope from which each group deviates.<br />
	real pop_intercept;<br />
	real pop_slope;</p>
<p>	// Per-state standard deviations for intercept and slope<br />
	vector&lt; lower=0 &gt;[2] state_sigma;</p>
<p>	// Negative binomial dispersion parameter<br />
	real&lt; lower=0 &gt; phi;</p>
<p>	// Parameter correlation matrix<br />
	corr_matrix[2] omega;</p>
<p>}</p>
<p>transformed parameters {</p>
<p>	vector[2] vec_intercept_slope[ states ];<br />
	vector[2] mu_intercept_slope;</p>
<p>	// Location parameter<br />
	vector[observations] eta;</p>
<p>	// Per-state intercepts and slopes<br />
	for( i in 1:states ) {</p>
<p>		vec_intercept_slope[ i, 1] = state_intercept[i];<br />
		vec_intercept_slope[ i, 2] = state_slope[i];</p>
<p>	}</p>
<p>	// Population slope and intercept<br />
	mu_intercept_slope[1] = pop_intercept;<br />
	mu_intercept_slope[2] = pop_slope;</p>
<p>	// Calculation negbinomial location parameter<br />
	for( i in 1:observations ) {<br />
		eta[i] = state_intercept[ state[i] ] + state_slope[ state[i]] * population[ i ];<br />
	}</p>
<p>}</p>
<p>model {</p>
<p>	// Priors<br />
	omega ~ lkj_corr(2);<br />
	phi ~ cauchy(0, 3 );<br />
	state_sigma ~ cauchy( 0, 3 );</p>
<p>	pop_intercept ~ normal( 0, 1 );<br />
	pop_slope ~ normal( 0, 1 );</p>
<p>	vec_intercept_slope ~ multi_normal( mu_intercept_slope, quad_form_diag( omega, state_sigma ) );</p>
<p>	// Model<br />
	counts ~ neg_binomial_2_log( eta, phi );</p>
<p>}</p>
<p>generated quantities {</p>
<p>	vector[observations] counts_pred;<br />
	vector[observations] log_lik;</p>
<p>	for (n in 1:observations) {</p>
<p>		log_lik[n] = neg_binomial_2_log_lpmf( counts[n] | eta[n], phi );<br />
		counts_pred[n] = neg_binomial_2_log_rng( eta[n], phi );</p>
<p>	}</p>
<p>}</p>
[/code]
</div></div>
</div>
<p>The ultimate test of our faith, then, is whether the added complexity of the partially-pooled varying slope, varying intercept model is justified. Once again, we turn to the ruthless judgement of the LOO-CV:</p>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show LOO-CV comparison code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
Code snippet for calculating the LOO-CV <code>elpd</code>. The full R code for building and comparing all models in this post is listed at the end.</p>
[code language=&#8221;R&#8221;]
&#8230;<br />
log_lik_negbinom_var_intercept_slope &lt;- extract_log_lik(fit_ufo_pop_negbinom_var_intercept_slope, merge_chains = FALSE)<br />
r_eff_negbinom_var_intercept_slope &lt;- relative_eff(exp(log_lik_negbinom_var_intercept_slope))<br />
loo_negbinom_var_intercept_slope &lt;- loo(log_lik_negbinom_var_intercept_slope, r_eff = r_eff_negbinom_var_intercept_slope, cores = 2)<br />
&#8230;<br />
[/code]
</div></div>
</div>
<pre class="brush: plain; title: ; notranslate">
&gt; compare( loo_negbinom_var_intercept, loo_negbinom_var_intercept_slope )
elpd_diff        se 
     13.3       2.4 
</pre>
<p>In this final step we can see that our labours in the arcane have been rewarded. The final model is once again a significant improvement over its simpler relatives. Whilst the potential for deeper and more perfect models never ends, we will settle for now on this.</p>
<h1>Mortal Consequences</h1>
<p>With our final model built, we can now begin to examine its mortifying implications. We will leave the majority of the subjective analysis for the next, and final, post in this series. For now, however, we can reinforce our quantitative analysis with visual assessment of the posterior predictive distribution output of our final model.</p>
<figure id="attachment_701" aria-describedby="caption-attachment-701" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive.png"><img loading="lazy" decoding="async" data-attachment-id="701" data-permalink="https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/posterior_predictive-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="bvi03-posterior_predictive" data-image-description="" data-image-caption="&lt;p&gt;Posterior predictive density plot of varying intercept, varying slope negative binomial model of UFO sightings.&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive-1024x576.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive.png" alt="" width="1920" height="1080" class="size-full wp-image-701" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive-64x36.png 64w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-701" class="wp-caption-text">Posterior predictive density plot of varying intercept, varying slope negative binomial GLM of UFO sightings. (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/posterior_predictive.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show posterior predictive plotting code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>bayes_plots.r</code><br />
(Includes code to generate traceplot and posterior predictive distribution plot.)<br />
[code language=&#8221;R&#8221;]
<p>library( tidyverse )<br />
library( magrittr )<br />
library( lubridate )</p>
<p>library( ggplot2 )<br />
library( showtext )<br />
library( cowplot )</p>
<p>library( rstan )<br />
library( bayesplot )<br />
library( tidybayes )</p>
<p># Load UFO data<br />
ufo_population_sightings &lt;-<br />
	readRDS(&quot;work/ufo_population_sightings.rds&quot;)</p>
<p># UFO reporting font<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/Tox Typewriter.ttf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/Tox Typewriter.ttf&quot;)<br />
showtext_auto()</p>
<p># Plots, posterior predictive checking, LOO. </p>
<p># (Visualisations only produced for varying slope/intercept model, as a result<br />
# of LOO checking.</p>
<p># Bayesplot needs to be told which theme to use as a default.<br />
theme_set( theme_weird() )</p>
<p># Read the fitted model<br />
fit_ufo_pop_negbinom_var_intercept_slope &lt;-<br />
	readRDS( &quot;work/fit_ufo_pop_negbinom_var_intercept_slope.rds&quot; )</p>
<p>## Model checking visualisations</p>
<p># Extract posterior estimates from the fit (from the generated quantities of the stan model)<br />
counts_pred_negbinom_var_intercept_slope &lt;- as.matrix( fit_ufo_pop_negbinom_var_intercept_slope, pars = &quot;counts_pred&quot; )</p>
<p># First, as always, a traceplot<br />
tp &lt;-<br />
	traceplot(<br />
				 fit_ufo_pop_negbinom_var_intercept_slope,<br />
				 pars = c(&quot;pop_intercept&quot;, &quot;pop_slope&quot;, &quot;phi&quot; ),<br />
				 ncol=1 ) +<br />
	scale_colour_viridis_d( name=&quot;Chain&quot;, direction=-1 ) +<br />
	theme_weird()</p>
<p>title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Traceplot of Key Parameters&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40) </p>
<p>titled_tp &lt;-<br />
	plot_grid(title, tp, ncol=1, rel_heights=c(0.1, 1)) +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;#222222&quot;, colour = &quot;#222222&quot;),<br />
			plot.background = element_rect(fill = &quot;#222222&quot;, colour = &quot;#222222&quot;),<br />
	) </p>
<p>save_plot(&quot;output/traceplot.pdf&quot;,<br />
			 titled_tp,<br />
			 base_width = 16,<br />
			 base_height = 9,<br />
			 base_aspect_ratio = 1.78 )</p>
<p># Posterior predictive density. (Visual representation of goodness of fit.)<br />
gp_ppc &lt;-<br />
	ppc_dens_overlay(<br />
						  y = extract2( ufo_population_sightings, &quot;count&quot; ),<br />
						  yrep = counts_pred_negbinom_var_intercept_slope  ) +<br />
	theme_weird()</p>
<p>title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Posterior Predictive Density Plot&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.40) </p>
<p>titled_gp_ppc &lt;-<br />
	plot_grid(title, gp_ppc, ncol=1, rel_heights=c(0.1, 1)) +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;#222222&quot;, colour = &quot;#222222&quot;),<br />
			plot.background = element_rect(fill = &quot;#222222&quot;, colour = &quot;#222222&quot;),<br />
	) </p>
<p>save_plot(&quot;output/posterior_predictive.pdf&quot;,<br />
			 titled_gp_ppc,<br />
			 base_width = 16,<br />
			 base_height = 9,<br />
			 base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<p>In comparison with earlier attempts, the varying intercept and slope model visibly captures the overall shape of the distribution with terrifying ease. As our wary confidence mounts in the mindless automaton we have fashioned, we can now examine its predictive ability on our original data.</p>
<figure id="attachment_703" aria-describedby="caption-attachment-703" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.png"><img loading="lazy" decoding="async" data-attachment-id="703" data-permalink="https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/predictive_plot-2/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="bvi03-predictive_plot" data-image-description="" data-image-caption="&lt;p&gt;Varying intercept and slope negative binomial GLM of UFO sightings against population.&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-1024x576.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.png" alt="" width="1920" height="1080" class="size-full wp-image-703" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot-64x36.png 64w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-703" class="wp-caption-text">Varying intercept and slope negative binomial GLM of UFO sightings against population. (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/predictive_plot.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show negative binomial varying intercept and slope plot code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>population_plot.r</code><br />
[code]
<p>library( tidyverse )<br />
library( magrittr )<br />
library( lubridate )</p>
<p>library( ggplot2 )<br />
library( showtext )<br />
library( cowplot )</p>
<p>library( rstan )<br />
library( bayesplot )<br />
library( tidybayes )<br />
library( modelr )</p>
<p># Load UFO data and model<br />
ufo_population_sightings &lt;-<br />
	readRDS(&quot;work/ufo_population_sightings.rds&quot;)</p>
<p>fit_ufo_pop_negbinom_var_intercept_slope &lt;-<br />
	readRDS(&quot;work/fit_ufo_pop_negbinom_var_intercept_slope.rds&quot;)<br />
	#readRDS(&quot;work/fit_ufo_pop_normal_var_intercept_slope.rds&quot;)</p>
<p># UFO reporting font<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/Tox Typewriter.ttf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/Tox Typewriter.ttf&quot;)<br />
showtext_auto()</p>
<p># Plots, posterior predictive checking, LOO<br />
theme_set( theme_weird() )</p>
<p>## Model checking visualisations</p>
<p># Extract posterior estimates from the fit (from the generated quantities of the stan model)<br />
counts_pred_negbinom_var_intercept_slope &lt;- as.matrix( fit_ufo_pop_negbinom_var_intercept_slope, pars = &quot;counts_pred&quot; )</p>
<p>## Create per-state predictive fit plots</p>
<p># Convert fitted model (stanfit) object to a tibble<br />
fit_tbl &lt;-<br />
	summary(fit_ufo_pop_negbinom_var_intercept_slope)$summary %&gt;%<br />
	as.data.frame() %&gt;%<br />
	mutate(variable = rownames(.)) %&gt;%<br />
	select(variable, everything()) %&gt;%<br />
	as_tibble()</p>
<p>counts_predicted &lt;-<br />
	fit_tbl %&gt;%<br />
	filter( str_detect(variable,&#8217;counts_pred&#8217;) ) </p>
<p>ufo_population_sightings_pred &lt;-<br />
	ufo_population_sightings %&gt;%<br />
	ungroup() %&gt;%<br />
	mutate( count_mean = counts_predicted$mean,<br />
			 lower = counts_predicted$`25%`,<br />
			 upper = counts_predicted$`75%`) </p>
<p># (Using mean and SD of fit summary)<br />
predictive_plot &lt;-<br />
	ggplot( ufo_population_sightings_pred ) +<br />
	geom_point( aes( x=population, y=count, colour=state ), size=0.6, alpha=0.8 ) +<br />
	geom_line(aes( x=population, y=count_mean, colour=state )) +<br />
	geom_ribbon(aes(x=population, ymin = lower, ymax = upper, fill=state), alpha = 0.25) +<br />
	labs( x=&quot;Population (Thousands)&quot;, y=&quot;Annual Sightings&quot; ) +<br />
	scale_fill_viridis_d( name=&quot;State&quot; ) +<br />
	scale_colour_viridis_d( name=&quot;State&quot; ) +<br />
	theme(<br />
			axis.title.y = element_text( angle=90 ),<br />
			legend.position = &quot;none&quot; )</p>
<p># Construct full plot, with title and backdrop.<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;UFO Sightings against State Population (1990-2014)&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;Negative Binomial Hierarchical GLM. Varying slope and intercept. 50% credible intervals.&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.48) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.16) </p>
<p>data_label &lt;- ggdraw() +<br />
	draw_label(&quot;Data: http://www.nuforc.org | Tool: http://www.mc-stan.org&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=1, x=0.98 ) </p>
<p>predictive_plot_titled &lt;-<br />
	plot_grid(title, predictive_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) +<br />
	theme(<br />
			panel.background = element_rect(fill = &quot;#222222&quot;, colour = &quot;#222222&quot;),<br />
			plot.background = element_rect(fill = &quot;#222222&quot;, colour = &quot;#222222&quot;),<br />
	) </p>
<p>save_plot(&quot;output/predictive_plot.pdf&quot;,<br />
			 predictive_plot_titled,<br />
			 base_width = 16,<br />
			 base_height = 9,<br />
			 base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<p>The purpose of our endeavours is to show whether or not the frequency of extraterrestrial visitations is merely a sad reflection of the number of unsuspecting humans living in each state. After seemingly endless cryptic calculations, our statistical machinery implies that there are deeper mysteries here: allowing the relationship between sightings and the underlying linear predictors to vary by state more perfectly predicts the data. There are clearly other, hidden, factors in play.</p>
<p>More than that, however, our final model allows us to quantify these differences. We can now retrieve from the very bowels of our inferential process the per-state distribution of paremeters for both the slope and intercept of the linear predictor.</p>
<figure id="attachment_705" aria-describedby="caption-attachment-705" style="width: 1920px" class="wp-caption aligncenter"><a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png"><img loading="lazy" decoding="async" data-attachment-id="705" data-permalink="https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/ufo_per-state_intercepts-slopes/" data-orig-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png" data-orig-size="1920,1080" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ufo_per-state_intercepts-slopes" data-image-description="" data-image-caption="&lt;p&gt;Varying slope and intercept negative binomial GLM parameter plot.&lt;/p&gt;
" data-medium-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-300x169.png" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-1024x576.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png" alt="" width="1920" height="1080" class="size-full wp-image-705" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-64x36.png 64w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /></a><figcaption id="caption-attachment-705" class="wp-caption-text">Varying slope and intercept negative binomial GLM parameter plot for UFO sightings model. (<a href="http://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.pdf">PDF Version</a>)</figcaption></figure>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show per-state intercept and slope plotting code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p><code>slope_intercept_plot.r</code><br />
[code language=&#8221;R&#8221;]
<p>library( tidyverse )<br />
library( magrittr )<br />
library( lubridate )</p>
<p>library( ggplot2 )<br />
library( showtext )<br />
library( cowplot )</p>
<p>library( rstan )<br />
library( bayesplot )<br />
library( tidybayes )<br />
library( modelr )</p>
<p># Load UFO data and model<br />
ufo_population_sightings &lt;-<br />
	readRDS(&quot;work/ufo_population_sightings.rds&quot;)</p>
<p>fit_ufo_pop_negbinom_var_intercept_slope &lt;-<br />
	readRDS(&quot;work/fit_ufo_pop_negbinom_var_intercept_slope.rds&quot;)<br />
	#readRDS(&quot;work/fit_ufo_pop_normal_var_intercept_slope.rds&quot;)</p>
<p># UFO reporting font<br />
font_add( &quot;main_font&quot;, &quot;/usr/share/fonts/TTF/weird/Tox Typewriter.ttf&quot;)<br />
font_add( &quot;bold_font&quot;, &quot;/usr/share/fonts/TTF/weird/Tox Typewriter.ttf&quot;)<br />
showtext_auto()</p>
<p># Plots, posterior predictive checking, LOO<br />
theme_set( theme_weird() )</p>
<p># Use teal colour scheme<br />
color_scheme_set( &quot;teal&quot;)</p>
<p>## Model checking visualisations</p>
<p># Extract posterior estimates from the fit (from the generated quantities of the stan model)<br />
counts_pred_negbinom_var_intercept_slope &lt;- as.matrix( fit_ufo_pop_negbinom_var_intercept_slope, pars = &quot;counts_pred&quot; )</p>
<p># US state data<br />
us_state_factors &lt;-<br />
	levels( factor( ufo_population_sightings$state ) )</p>
<p># US state names for nice plotting<br />
# Data: &lt;https://www.50states.com/abbreviations.htm&gt;<br />
state_code_data &lt;-<br />
	read_csv( file=&quot;data/us_states.csv&quot; ) %&gt;%<br />
	filter( code %in% us_state_factors ) </p>
<p># Rename variables back to state names<br />
posterior_intercepts &lt;-<br />
	as.data.frame( fit_ufo_pop_negbinom_var_intercept_slope ) %&gt;%<br />
	as_tibble %&gt;%<br />
	select(starts_with(&#8216;state_intercept&#8217;) ) %&gt;%<br />
	rename_all( ~us_state_factors ) %&gt;%<br />
	rename_all( ~extract2( state_code_data, &quot;us_state&quot; ) )</p>
<p># Rename variables back to state names<br />
posterior_slopes &lt;-<br />
	as.data.frame( fit_ufo_pop_negbinom_var_intercept_slope ) %&gt;%<br />
	as_tibble %&gt;%<br />
	select(starts_with(&#8216;state_slope&#8217;) ) %&gt;%<br />
	rename_all( ~us_state_factors ) %&gt;%<br />
	rename_all( ~extract2( state_code_data, &quot;us_state&quot; ) )</p>
<p># Posterior draws combined<br />
posterior_slopes_long &lt;-<br />
	posterior_slopes %&gt;%<br />
	gather( value = &quot;slope&quot; )</p>
<p>posterior_intercepts_long &lt;-<br />
	posterior_intercepts %&gt;%<br />
	gather( value = &quot;intercept&quot; )</p>
<p>posterior_draws_long &lt;-<br />
	bind_cols( posterior_intercepts_long, posterior_slopes_long ) %&gt;%<br />
	select( -key1 ) %&gt;%<br />
	transmute( state = key, intercept, slope )</p>
<p># Interval plots (slope and intervals)<br />
# Plot intercept parameters for varying intercept and slope model<br />
gp_intercept &lt;-<br />
	mcmc_intervals( posterior_intercepts ) +<br />
	ggtitle( &quot;Intercepts&quot; ) +<br />
	theme_weird() </p>
<p># Plot slope parameters for varying intercept and slope model<br />
# (Remove y-axis labels as this will be aligned with the intercept plot.)<br />
gp_slope &lt;-<br />
	mcmc_intervals( posterior_slopes ) +<br />
	ggtitle( &quot;Slopes&quot; ) +<br />
	theme_weird() +<br />
	theme(<br />
			axis.text.y = element_blank()<br />
			)</p>
<p>gp_slope_intercept &lt;-<br />
	plot_grid( gp_intercept, gp_slope, ncol=2 )</p>
<p># Construct full plot, with title and backdrop.<br />
title &lt;-<br />
	ggdraw() +<br />
	draw_label(&quot;Per-State UFO Intercepts and Slopes&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=20, hjust=0, vjust=1, x=0.02, y=0.88) +<br />
	draw_label(&quot;Mean value, 50% credible interval, and 95% credible interval shown.&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.48) +<br />
	draw_label(&quot;http://www.weirddatascience.net | @WeirdDataSci&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=0, vjust=1, x=0.02, y=0.16) </p>
<p>data_label &lt;- ggdraw() +<br />
	draw_label(&quot;Data: http://www.nuforc.org | Tool: http://www.mc-stan.org&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=1, x=0.98 ) </p>
<p>data_label &lt;- ggdraw() +<br />
	draw_label(&quot;Data: http://www.nuforc.org | Tool: http://www.mc-stan.org&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=1, x=0.98 ) </p>
<p>data_label &lt;- ggdraw() +<br />
	draw_label(&quot;Data: http://www.nuforc.org | Tool: http://www.mc-stan.org&quot;, fontfamily=&quot;main_font&quot;, colour = &quot;#cccccc&quot;, size=12, hjust=1, x=0.98 ) </p>
<p>gp_slope_intercept_titled &lt;-<br />
	plot_grid(title, gp_slope_intercept, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) +<br />
	theme( panel.background = element_rect(fill = &quot;#222222&quot;, colour = &quot;#222222&quot;),<br />
			plot.background = element_rect(fill = &quot;#222222&quot;, colour = &quot;#222222&quot;)) </p>
<p>save_plot(&quot;output/ufo_per-state_intercepts-slopes.pdf&quot;,<br />
			 gp_slope_intercept_titled,<br />
			 base_width = 16,<br />
			 base_height = 9,<br />
			 base_aspect_ratio = 1.78 )</p>
[/code]
</div></div>
</div>
<p>It is important to note that, while we are still referring to the &#92;(\alpha&#92;) and &#92;(\beta&#92;) parameters as the slope and intercept, their interpretation is more complex in a generalised linear model with a &#92;(\log&#92;) link function than in the simple linear model. For now, however, this diagram is sufficient to show that the horror visited on innocent lives by our interstellar visitors is not purely arbitrary, but depends at least in part on geographical location.</p>
<p>With this malign inferential process finally complete we will turn, in the next post, to a trembling interpretation of the model and its dark implications for our collective future.</p>
<h2>Model Fitting and Comparison Code Listing</h2>
<div class="su-accordion su-u-trim">
<div class="su-spoiler su-spoiler-style-fancy su-spoiler-icon-chevron su-spoiler-closed" data-scroll-offset="0" data-anchor-in-url="no"><div class="su-spoiler-title" tabindex="0" role="button"><span class="su-spoiler-icon"></span>Show full model fitting and LOO-CV comparison code.</div><div class="su-spoiler-content su-u-clearfix su-u-trim">
<p>This code fits the range of models developed in this series, relying on the individual Stan source code files, and runs the LOO-CV comparisons discussed in this post.</p>
<p><code>population_model.r</code><br />
[code]
library( tidyverse )<br />
library( magrittr )</p>
<p>library( ggplot2 )<br />
library( showtext )</p>
<p>library( rstan )<br />
library( tidybayes )<br />
library( loo )</p>
<p># Load UFO data<br />
ufo_population_sightings &lt;-<br />
	readRDS(&quot;work/ufo_population_sightings.rds&quot;)</p>
<p>## Simple Models<br />
## Complete Pooling &#8212; all states considered identical. </p>
<p># Fit model of UFO sightings (Normal)<br />
if( not( file.exists( &quot;work/fit_ufo_pop_normal.rds&quot; ) ) ) {</p>
<p>	message(&quot;Fitting basic Normal model.&quot;)</p>
<p>	fit_ufo_pop_normal &lt;-<br />
		stan( file=&quot;model/population_model_normal.stan&quot;,<br />
			  data=list(<br />
							observations = nrow( ufo_population_sightings ),<br />
							population = extract2( ufo_population_sightings, &quot;population&quot; ),<br />
							counts = extract2( ufo_population_sightings, &quot;count&quot; )<br />
			  )<br />
		)</p>
<p>	saveRDS( fit_ufo_pop_normal, &quot;work/fit_ufo_pop_normal.rds&quot; )</p>
<p>	message(&quot;Basic Normal model fitted.&quot;)</p>
<p>} else {</p>
<p>	fit_ufo_pop_normal &lt;- readRDS( &quot;work/fit_ufo_pop_normal.rds&quot; )</p>
<p>}</p>
<p># Fit model of UFO sightings (Poisson)<br />
if( not( file.exists( &quot;work/fit_ufo_pop_poisson.rds&quot; ) ) ) {</p>
<p>	message(&quot;Fitting basic Poisson model.&quot;)</p>
<p>	fit_ufo_pop_poisson &lt;-<br />
		stan( file=&quot;model/population_model_poisson.stan&quot;,<br />
			  data=list(<br />
							observations = nrow( ufo_population_sightings ),<br />
							population_raw = extract2( ufo_population_sightings, &quot;population&quot; ),<br />
							counts = extract2( ufo_population_sightings, &quot;count&quot; )<br />
			  )<br />
		)</p>
<p>	saveRDS( fit_ufo_pop_poisson, &quot;work/fit_ufo_pop_poisson.rds&quot; )</p>
<p>	message(&quot;Basic Poisson model fitted.&quot;)</p>
<p>} else {</p>
<p>	fit_ufo_pop_poisson &lt;- readRDS( &quot;work/fit_ufo_pop_poisson.rds&quot; )</p>
<p>}</p>
<p># Fit model of UFO sightings (Negative Binomial)<br />
if( not( file.exists( &quot;work/fit_ufo_pop_negbinom.rds&quot; ) ) ) {</p>
<p>	message(&quot;Fitting basic negative binomial model.&quot;)</p>
<p>	fit_ufo_pop_negbinom &lt;-<br />
		stan( file=&quot;model/population_model_negbinomial.stan&quot;,<br />
			  data=list(<br />
							observations = nrow( ufo_population_sightings ),<br />
							population_raw = extract2( ufo_population_sightings, &quot;population&quot; ),<br />
							counts = extract2( ufo_population_sightings, &quot;count&quot; ) )<br />
		)</p>
<p>	saveRDS( fit_ufo_pop_negbinom, &quot;work/fit_ufo_pop_negbinom.rds&quot; )</p>
<p>	message(&quot;Basic negative binomial model fitted.&quot;)</p>
<p>} else {</p>
<p>	fit_ufo_pop_negbinom &lt;- readRDS( &quot;work/fit_ufo_pop_negbinom.rds&quot; )</p>
<p>}</p>
<p>## Multilevel Models<br />
## Partial Pooling (Varying Intercept)</p>
<p>if( not( file.exists( &quot;work/fit_ufo_pop_negbinom_var_intercept.rds&quot; ) ) ) {</p>
<p>	message(&quot;Fitting varying intercept negative binomial model.&quot;)</p>
<p>	fit_ufo_pop_negbinom_var_intercept &lt;-<br />
		stan( file=&quot;model/population_model_negbinomial_var_intercept.stan&quot;,<br />
			  data=list(<br />
							observations = nrow( ufo_population_sightings ),<br />
							population_raw = extract2( ufo_population_sightings, &quot;population&quot; ),<br />
							counts = extract2( ufo_population_sightings, &quot;count&quot; ),<br />
							states = length( unique( ufo_population_sightings$state ) ),<br />
							state = as.numeric( factor( ufo_population_sightings$state ) )<br />
							),<br />
			  chains=4, iter=2000,<br />
			  control = list(max_treedepth = 15, adapt_delta=0.9)<br />
		)</p>
<p>	saveRDS( fit_ufo_pop_negbinom_var_intercept, &quot;work/fit_ufo_pop_negbinom_var_intercept.rds&quot; )</p>
<p>	message(&quot;Varying intercept negative binomial model fitted.&quot;)</p>
<p>} else {</p>
<p>	fit_ufo_pop_negbinom_var_intercept &lt;- readRDS( &quot;work/fit_ufo_pop_negbinom_var_intercept.rds&quot; )</p>
<p>}</p>
<p>### Partial Pooling (Varying intercept and slope.)</p>
<p>if( not( file.exists( &quot;work/fit_ufo_pop_negbinom_var_intercept_slope.rds&quot; ) ) ) {</p>
<p>	message(&quot;Fitting varying intercept and slope negative binomial model&quot;)</p>
<p>	fit_ufo_pop_negbinom_var_intercept_slope &lt;-<br />
		stan( file=&quot;model/population_model_negbinomial_var_intercept_slope.stan&quot;,<br />
			  data=list(<br />
							observations = nrow( ufo_population_sightings ),<br />
							population_raw = extract2( ufo_population_sightings, &quot;population&quot; ),<br />
							counts = extract2( ufo_population_sightings, &quot;count&quot; ),<br />
							states = length( unique( ufo_population_sightings$state ) ),<br />
							state = as.numeric( factor( ufo_population_sightings$state ) )<br />
							),<br />
			  chains=4, iter=2000,<br />
			  control = list(max_treedepth = 12, adapt_delta=0.8)<br />
		)</p>
<p>	saveRDS( fit_ufo_pop_negbinom_var_intercept_slope, &quot;work/fit_ufo_pop_negbinom_var_intercept_slope.rds&quot; )</p>
<p>	message(&quot;Varying intercept and slope negative binomial model fitted.&quot;)</p>
<p>} else {</p>
<p>	fit_ufo_pop_negbinom_var_intercept_slope &lt;- readRDS( &quot;work/fit_ufo_pop_negbinom_var_intercept_slope.rds&quot; )</p>
<p>}</p>
<p># Hierarchical normal. (Linear regression)<br />
if( not( file.exists( &quot;work/fit_ufo_pop_normal_var_intercept_slope.rds&quot; ) ) ) {</p>
<p>	message(&quot;Fitting varying intercept and slope normal model&quot;)</p>
<p>	fit_ufo_pop_normal_var_intercept_slope &lt;-<br />
		stan( file=&quot;model/population_model_normal_var_intercept_slope.stan&quot;,<br />
			  data=list(<br />
							observations = nrow( ufo_population_sightings ),<br />
							population_raw = extract2( ufo_population_sightings, &quot;population&quot; ),<br />
							counts = extract2( ufo_population_sightings, &quot;count&quot; ),<br />
							states = length( unique( ufo_population_sightings$state ) ),<br />
							state = as.numeric( factor( ufo_population_sightings$state ) )<br />
							),<br />
			  chains=4, iter=2000,<br />
			  control = list(max_treedepth = 15, adapt_delta=0.9)<br />
		)</p>
<p>	saveRDS( fit_ufo_pop_normal_var_intercept_slope, &quot;work/fit_ufo_pop_normal_var_intercept_slope.rds&quot; )</p>
<p>	message(&quot;Varying intercept and slope normal model fitted.&quot;)</p>
<p>} else {</p>
<p>	fit_ufo_pop_normal_var_intercept_slope &lt;- readRDS( &quot;work/fit_ufo_pop_normal_var_intercept_slope.rds&quot; )</p>
<p>}</p>
<p>## Notify by text<br />
message(&quot;All models fit.&quot;)</p>
<p># Compare models with LOO<br />
log_lik_normal &lt;- extract_log_lik(fit_ufo_pop_normal, merge_chains = FALSE)<br />
r_eff_normal &lt;- relative_eff(exp(log_lik_normal))<br />
loo_normal &lt;- loo(log_lik_normal, r_eff = r_eff_normal, cores = 2)</p>
<p>log_lik_poisson &lt;- extract_log_lik(fit_ufo_pop_poisson, merge_chains = FALSE)<br />
r_eff_poisson &lt;- relative_eff(exp(log_lik_poisson))<br />
loo_poisson &lt;- loo(log_lik_poisson, r_eff = r_eff_poisson, cores = 2)</p>
<p>log_lik_negbinom &lt;- extract_log_lik(fit_ufo_pop_negbinom, merge_chains = FALSE)<br />
r_eff_negbinom &lt;- relative_eff(exp(log_lik_negbinom))<br />
loo_negbinom &lt;- loo(log_lik_negbinom, r_eff = r_eff_negbinom, cores = 2)# Compare models with LOO</p>
<p>log_lik_negbinom_var_intercept &lt;- extract_log_lik(fit_ufo_pop_negbinom_var_intercept, merge_chains = FALSE)<br />
r_eff_negbinom_var_intercept &lt;- relative_eff(exp(log_lik_negbinom_var_intercept))<br />
loo_negbinom_var_intercept &lt;- loo(log_lik_negbinom_var_intercept, r_eff = r_eff_negbinom_var_intercept, save_psis = TRUE)</p>
<p>log_lik_negbinom_var_intercept_slope &lt;- extract_log_lik(fit_ufo_pop_negbinom_var_intercept_slope, merge_chains = FALSE)<br />
r_eff_negbinom_var_intercept_slope &lt;- relative_eff(exp(log_lik_negbinom_var_intercept_slope))<br />
loo_negbinom_var_intercept_slope &lt;- loo(log_lik_negbinom_var_intercept_slope, r_eff = r_eff_negbinom_var_intercept_slope, save_psis = TRUE)</p>
<p>normal_poisson_comparison &lt;- compare( loo_normal, loo_poisson )<br />
poiss_negbinom_comparison &lt;- compare( loo_poisson, loo_negbinom )<br />
negbinom_negbinom_var_intercept_comparison &lt;- compare( loo_negbinom, loo_negbinom_var_intercept )<br />
negbinom_var_intercept_negbinom_var_intercept_slope_comparison &lt;- compare( loo_negbinom_var_intercept, loo_negbinom_var_intercept_slope )</p>
<p>saveRDS( normal_poisson_comparison, &quot;work/normal_poisson_comparison.rds&quot; )<br />
saveRDS( poiss_negbinom_comparison, &quot;work/poiss_negbinom_comparison.rds&quot; )<br />
saveRDS( negbinom_negbinom_var_intercept_comparison, &quot;work/negbinom_negbinom_var_intercept_comparison.rds&quot; )<br />
saveRDS( negbinom_var_intercept_negbinom_var_intercept_slope_comparison, &quot;work/negbinom_var_intercept_negbinom_var_intercept_slope_comparison.rds&quot; )</p>
[/code]
</div></div>
</div>
<h2>Footnotes</h2>
]]></content:encoded>
					
					<wfw:commentRss>https://www.weirddatascience.net/2019/04/17/bayes-vs-the-invaders-part-three-the-parallax-view/feed/</wfw:commentRss>
			<slash:comments>4</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">654</post-id>	</item>
	</channel>
</rss>
