stan – Weird Data Science

Bayes vs. the Invaders (Redivivus)

moth — Sat, 15 Nov 2025 10:50:27 +0000

Straying still further from whatever dubious graces it once presumed, academia’s luciferous descent into murky realms of huddled speculation continues unabated. Relocated, reconstituted, in ever-fading cycles, the Oxford Internet Institute at the University of Oxford once again saw fit to deprive its students of the peace and comfort of banal rationality through this fourth annual Halloween Lecture.

In the absence of fresh insight, this year’s lecture revisits the chilling implications of humanity’s contact with inexplicable aerial and marine phenomena, drawing from the faintest whispers of the ante-historical record through to the shimmering echoes of statistical reasoning. Through what means do visitors from beyond the void encroach on our night-time skies? What subtle deceptions underpin their visible geometries? What values lie behind their peculiar interests in certain uncomfortably-favoured regions?

Despite all safeguards, and in the face of numerous barely-perceptible currents opposing such efforts, this event was captured, stored, and released into a world still cruelly unprepared to face its findings.

More details, and the underlying code, for these findings can be found–for those unwary enough to look–in the series of entries beginning here.

Oxford Internet Institute Halloween Lecture
Bayes vs. the Invaders (Redivivus): A Bayesian Analysis of 70 Years of UFO Sightings
Prof. Joss Wright
Oxford. October 2025

The searing heat of summer retreats, cools, fades, surrendering its vitality to the flickering uncertainties of autumn. Nights draw close, like dimly-remembered friends clustering in our dreams. The spring leaves abandon their verdant dance, as they age, wither, and drift into a russet swirl of skeletal, wind-stirred fragments.

The dying seasons return, dragging with them time-hallowed fears and uneasy rumours, pooling around these darkly dreaming spires in a mire of primal superstition. The agonizingly brittle certainties of the modern enlightenment, our desperate faith in the gossamer fabrics of scientific progress, falter in the face of primal terrors that lurk implacably in the gloom.

In these darkening days, as faith in our treasured understanding dims, it is yet again time to turn our faces fully to the darkness. Halloween, slouching inexorably towards our minds, impels us as scholars to gather our methods, our theories, our data, our knowledge; and glean what light we can from the primordial glimmers of the unknown.

Unidentifiable aerial and marine phenomena. Impossible lights in the sky. Patterns of visitation and terror. Insidious influences from the hadal voids between the stars. Who–what–swoop and glide through the ink-black nights of our world, probing and testing our structures, our societies, our minds? From barely remembered history, to early reports of impossible objects, to blurrily evidenced documentation, data concerning flying arcane observations has grown and twisted, along with our capacity to lay them bare, to subject them to analysis, and to interrogate their secrets.

In this year’s OII Halloween Lecture, we will tremblingly revisit a Bayesian analysis of seventy years of UFO sightings, drawn from a dataset collected by the National UFO Reporting Center (NUFORC). Scepticism, fear, doubt, and most accepted standards of statistical rigour, will be cast aside in our unyielding and disquieting pursuit of an uncompromised truth.

2025-bayes_vs_invaders-redivivus

Numbers of the Beast: Sasquatch Distribution Modelling

moth — Mon, 21 Jul 2025 18:09:01 +0000

Dredged from the stygian depths of last year’s waning: a cruelly belated artefact of academia’s primordial descent. For a third time, the Oxford Internet Institute at the University of Oxford—scarcely willing, but falteringly unable to reflect on its own ill-fated decisions—chose to risk the sanity and statistical intuition of its students against the numerically unstable nightmares writhing beneath the tattered fabric of our waking world.

This third OII Halloween Lecture sinks bodily into the tortured mass of data concerning cryptozoological sightings in North America. Drawing on over a century of shadow-haunted sightings documenting the curiously repellant presence of the North America Sasquatch, or Bigfoot, we aim to identify the factors associated with its presence, delineate the confounding presence of other dread manifestations, and cast our minds globally for a faded glimpse of its remote and scarcely-conceived brethren.

In what can only generously be considered an act of gross negligence, this lecture was stored, at great risk to both its author and its audience, digitally. Technology has, once again, brought us closer than we might care to dread to secrets that we were never intended to taste.

A sharper, more visceral presentation of the baroque thinking underpinning these materials will follow, in an ever-spiralling series of posts on this site. Until then, however, incautious travellers may solemnly consider the below.

OII Halloween Lecture
Sasquatch Distribution Modelling: Investigating patterns of Bigfoot sightings in North America.
Prof. Joss Wright
Oxford. October 2024

The nights grow ever closer. Streetlights flicker, shrouded in the mists, striving to pierce the gloom. The warm certainties of summer give way to the cold, dark ambiguities of autumn. Rumour, myth, and legend rise, primeval, from the shadowed recesses of our collective consciousness, undermining our faith in the fragile congruities that structure our lives.

As summer surrenders to the inexorable tread of autumn, our resolve falters against the unknown horrors residing in the tenebrous peripheries of the world. As scientists, as scholars, our duty is to cling resolutely to our methods and our ideals in the face of encroaching darkness. Our tools may seem fragile in the face of the seething irrationality of the night, but we are called to peer, however tremulously, wherever our inquiries may lead us.

Lights streak across the sky. Stories are woven of twisted faces in the darkness, half-glimpsed creatures in ancient forests, strange encounters in the wilderness. From the earliest stirrings of humanity, to the patterns of complex arcana that silently control our lives today, folklore and legend have long reported phenomena that rebel against mundane description or understanding. As our technologies evolve, and our ability to collate, scrutinise, and manipulate data spiral beyond all restraint, we are ever more capable, if not indeed obliged, to bring the lens of science to bear on these harrowing mysteries.

To embrace this dark season, you are invited to the annual Oxford Internet Institute Halloween Lecture.

This year we will pursue one of the world’s most notorious cryptozoological phenomena, investigating over a century of data regarding sightings of the Sasquatch, or Bigfoot, of North America. In which regions are these cryptic precursors of humanity most commonly observed? What factors, whether environmental or physical, create habitats most suitable for the Sasquatch to thrive, hidden from the encroaching pressures of humanity? Where, were we bold enough to look, might we seek other populations of these elusive creatures?

In this lecture we will examine some of the history surrounding sightings of Bigfoot, and related cryptids. We will impetuously apply statistical methods to derive underlying patterns from reported sightings, and heedlessly strive to uncover their meaning and implications. What can we learn from the accumulated data about the habits of cryptic species living on the fringes of our world? Is the beast, as ever, closer to us than we wish to believe?

sasquatch_distribution_modelling

Readings from the Book

moth — Sun, 28 Jan 2024 16:27:02 +0000

Once again, the Oxford Internet Institute at the University of Oxford — through madness, or through omission brought on by horrified incredulity — saw fit to expose its students to the nightmarish patterns that descend, fractal-like, endlessly below the surface of mundane reality.

This second OII Halloween Lecture drew on the twisted meanderings we travellers have taken through the cryptic verbiage of the Voynich Manuscript. We aim to establish the dread authenticity of the text, by rousing its very statistical bones from the inscrutable fasciae of its pages. Walking a tightrope between careful statistical exploration and ever-burgeoning insanity, we further explore the structures that arise from the text, separating the untranslated knowledge in the book into coherent bodies for future study.

In yet another, almost criminially negligent, oversight, the OII’s 2024 Halloween Lecture was captured, frozen in space and time, for the detriment and despair of the unexpectant world.

For those not driven to blissful negation by the tortured ramblings of the above, the underlying materials for the talk are presented, with neither hope nor tremor, here.

illuminating_the_illuminated

Whisperings in the Academy

moth — Sun, 20 Nov 2022 13:12:43 +0000

The noblest of human endeavours is to enlighten the uninitiated consciousness; to bare its awareness before the endless and terrifying vistas that lie beyond darkness and ignorance.

In pursuit of such necessarily painful revelations the Oxford Internet Insitute at the University of Oxford — the unwitting host on which the investigations here parasitise — recently hosted an inaugural Halloween lecture. This oration drew on several years of dark explorations chronicled in this blog, to inculcuate into a new generation of unprepared and curious minds the horror and necessity of subjecting our reality to the insidious power of statistical science. Through what seems a dangerously careless oversight, this brief glimpse of truth was recorded and made available for posterity.

For the terminally inquisitive, the archival materials on which this work was drawn are presented here.

bayes_vs_invaders

Illuminating the Illuminated – Part Four: Tempora Mutantur | Changepoint Analysis of the Voynich Manuscript

moth — Fri, 21 Feb 2020 10:31:38 +0000

Our past interrogation of the Voynich Manuscript has deconstructed its esoteric symbols into a form more suitable for our ends, subjected its statistical properties to comparison with more mundane texts, and unearthed its hidden internal structures via the esoteric process of topic modelling. In this final post, we will build on the structures revealed in earlier posts to ask how, if at all, the Voynich Manuscript’s textual properties shift within the text itself. Are there significant discontinuities in the writing, indicating a separation of the manuscript into meaningful sections? Or is the text merely a homogenous mass more suggestive of a rote, mechanical, generative procedure?

To address this question we will delve once more into the arcana of machine learning, and draw out the technique of changepoint analysis. This procedure aims to identify one or more points in a series of observations at which the underlying process that generates the data has somehow altered.

Once more, we will operate within the warm embrace of Bayesian statistics and exploit the Stan modelling language as our means to cast light into the darkness¹.

On Shifting Sands

Changepoint analysis is an active field of endeavour, with deep subtleties in its application. For the purposes of this analysis we will focus on the comparatively simple problem of identifying a single changepoint amongst the teeming mass of the Voynich Manuscript’s strangely compelling glyphs.

Statistical analysis of changepoints has been applied to a number of historical texts. To provide a first, tantalising glimpse into a world characterised by authorship analysis and the dark arts of adversarial stylometry, we begin by reproducing the work of Riba and Ginebra in ascertaining the existence of a shift in authorship in the 15th century Catalan chivalric romance Tirant lo Blanc.

Briefly, Tirant lo Blanc was written by Joanot Martorel, a Valencian knight, whose untimely death left the manuscript unfinished. The work was completed and published by Martí Joan de Galba. The specific nature of his contributions have, however, been the subject of some debate: did he substantially compose parts of the text, or simply arrange and edit the work?

Riba and Ginebra proposed, in 2005, to identify any stylistic change in the work through a Bayesian analysis of the frequency of word lengths in the document. Their approach, built on a tradition of such analyses in the stylometry literature, relies on the fact that differing authors, even when attempting to mimic the style of another, unconsciously make different word choices. Most importantly, the relative frequency of shorter context free words is likely to differ between authors.

From a certain analytical perspective, each word in a sequence of text can be represented merely by its length, as in the analyses of our earlier posts. Taking this view to a probabilistic extreme, a text, therefore, can be considered as a sequence of draws from a categorical distribution of word lengths. We may consider the length of each word as resulting from rolls of some abstract, biased die with a number of sides equal to the number of possible word lengths in the prose. The full text, therefore, is itself a multinomial distribution in which the number of categories matches the possible lengths of words observed in the document, and the number of trials is the number of words in the text. The stylistic differences between authors, therefore, is known to be revealed most strongly in the relative frequency of the shorter words.

Having descended thus far, and with dark suspicions of an authorship change disrupting the contiguity of Martorel’s opus, we hypothesise that the entire volume may best be described by not one, but by two multinomial distributions over short word lengths: one for the earlier text of the original author, with a second describing the later contributions of his posthumous collaborator. The point of division between these distributions is the changepoint.

Winds of Change

To drag this concept from its abstract formulation to a tangible realisation, we turn to the Stan probabilistic programming language, The models here draw heavily on the Stan user’s guide changepoint section, which lays out the concepts underlying these approaches with horrifying clarity.

Perhaps the most unusual aspect of implementing this model is due to Stan’s inability to sample discrete parameters; in this case the location of the changepoint. As such the model must conceal a latent discrete parameter cunningly hidden in its construction. This may then be marginalised out to reveal the probability of each value of the latent parameter. The model we construct here, therefore, will provide us not simply with a point estimate of the most likely changepoint, but a set of probabilities for each possible changepoint.

As cryptically hinted above, our model of the Voynich text abstracts its prose to counts of word lengths found on each folio of the manuscript, resulting in a multinomial distribution of word lengths. For the changepoint, we hypothesise that there are not one, but two multinomials with different parameters falling on either side of some changepoint. Our goal is to identify the pivotal folio at which this underlying shift most likely occurs.

Representing this more formally, in the style used for generative Bayesian models, we can write the distribution of word lengths, $\Omega$, in terms of its hyperparameters as:

$$\begin{eqnarray}\
\Omega &\sim& \mathbf{Multinomial}( t < c~?~\theta_e : \theta_l )\\
\theta_e &\sim& \mathbf{Dirichlet}(\alpha)\\
\theta_l &\sim& \mathbf{Dirichlet}(\alpha)\\
c &\sim& \mathbf{Uniform}(0, 1)
\end{eqnarray}$$

The most crucial element for isolating the changepoint is the conditional operator in the first line. We treat the frequency of words of different lengths on a given folio as one point in a sequence, indexed by the value $t$. The conditional statement encodes that observed word lengths prior to some unknown point $t = c$ are drawn from a multinomial with one vector of parameters, $\theta_e$; from that folio onwards, word lengths are drawn according to a second multinomial with parameter vector $\theta_l$. When appropriately constructed, fitting the model produces a posterior distribution across the various possible $\theta_e$ and $\theta_l$ parameters for all possible changepoints. The folio at which the posterior probability of $c$ is highest is our best estimate of the changepoint, and is accompanied by estimates of $\theta_e$ and $\theta_l$ around that changepoint.

As mentioned above, we marginalise the discrete changepoint parameter, $c$, rather than sampling it directly. A full example of the concept, applied to a Poisson distribution, is given by the Stan Users’ Guide changepoint section. This key step moves the parameter $c$ from the full joint probability function of the model, resulting in a likelihood of word lengths according to the parameters $\theta_e$ and $\theta_l$, which can be calculated in Stan by summing over this likelihood for all possible values of $c$².

Reproducing, with only slight adaptations, the original Poisson example, our full joint probability would be:

$$p(\theta_e, \theta_l, c, \Omega) =
p(\theta_e)p(\theta_l)p(c)p(\Omega|\theta_e,\theta_l,c)$$

Marginalising out $c$, this can be represented as:

$$\begin{eqnarray}
p(\Omega|\theta_e, \theta_l) &=& \sum_{c=1}^Tp(c,\Omega|c,\theta_e,\theta_l)\\
&=& \sum_{c=1}^Tp(c)p(\Omega|c,\theta_e,\theta_l)
\end{eqnarray}$$

The result is that our Stan model can be constructed by sampling across values of $\theta_e$ and $\theta_l$ for all possible values of $c$. Due to the requirement to sum across all possible values of the discrete paramete, however, this subterfuge of marginalisation is restricted in general to bounded discrete parameters.

We can distill the above into a Stan model as given below.

Multinomial changepoint model

multinomial_changepoint.stan
[code language=”c”]

data {

int num_obs; // Number of observations (rows/pages) in data.
int num_cats; // Number of categories in data.
int y[num_obs, num_cats]; // Matrix of observations.

vector[num_cats] alpha; // Dirichlet prior values.

}

transformed data {

// Uniform prior across all time points for changepoint.
real log_unif;
log_unif = -log(num_obs);

}

parameters {

// Two sets of parameters.
// One (early) before changepoint, one (late) for after.
simplex[num_cats] theta_e;
simplex[num_cats] theta_l;

}

transformed parameters {

// // This code shows a slower, but easier to understand updating of log posterior via summation.
// vector[num_obs] lp;
// lp = rep_vector(log_unif, num_obs);
// for (s in 1:num_obs)
// for (t in 1:num_obs)
// lp[s] = lp[s] + multinomial_lpmf(y[t,] | t < s ? theta_e : theta_l);

// This approach relies on dynamic programming to reduce runtime from quadratic to linear in num_obs.
// See
vector[num_obs] log_p;
{
vector[num_obs + 1] log_p_e;
vector[num_obs + 1] log_p_l;

log_p_e[1] = 0;
log_p_l[1] = 0;

for( i in 1:num_obs ) {
log_p_e[i + 1] = log_p_e[i] + multinomial_lpmf(y[i,] | theta_e );
log_p_l[i + 1] = log_p_l[i] + multinomial_lpmf(y[i,] | theta_l );
}

log_p =
rep_vector( -log(num_obs) + log_p_l[num_obs + 1], num_obs) +
head(log_p_e, num_obs) – head(log_p_l, num_obs);
}
}

model {

// Priors
theta_e ~ dirichlet( alpha );
theta_l ~ dirichlet( alpha );

target += log_sum_exp( log_p );

}

generated quantities {

simplex[num_obs] changepoint_simplex; // Simplex of locations for changepoint.

// Convert the log posterior to a simplex.
changepoint_simplex = softmax( log_p );

}

[/code]

To launch the model in gentler waters, we apply it to Martorel and de Galba’s Tirant lo Blanc to discover where, if at all, de Galba’s major contributions to the text begin, and see the extent to which our Stan rendering reproduces the results of Riba and Ginebra’s analysis. As in their work we will focus only on those words with length shorter than 5 letters, where the key stylistic difference between authors makes itself apparent.

We place a relative uninformative Dirichlet prior on the multinomial $\theta_e$ and $\theta_l$ parameters, with the vector of $\alpha$ values set to one. This results in a uniform distribution across all possible simplexes. Note that this does not push the multinomial towards a uniform simplex, but instead that all possible simplexes are equally likely. $\alpha \ge 1$ produces increasingly uniform simplexes; $0 \le \alpha \le 1$ produces simplexes in which the probability mass is more likely to be concentrated in some given element.

With the resulting data and model in unison, we can see the results of this analytic process.

Tirant lo Blanc Changepoint

" data-image-caption="

Tirant lo Blanc Changepoint

" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot.png" alt="Tirant lo Blanc Changepoint" width="1920" height="1080" class="size-full wp-image-2035" srcset="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_tirant_plot-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" />

Tirant lo Blanc Changepoint | (PDF Version)

Our multinomial automaton suggests a location for a significant stylistic changepoint in Tirant lo Blanc with the main concentration of probability mass around chapter 374. Perhaps unsurprisingly, but pleasingly, this is in close agreement with the earlier analysis of Riba and Ginebra, who placed their estimates between chapters 371 and 382.

Tirant lo Blanc analysis code

[code language=”r”]

library( tidyverse )
library( tidyselect )
library( magrittr )

library( rstan )

library( tidytext )

# Load Tirant data
message( "Reading raw Tirant data…" )
tirant_tbl <-
read_csv( "data/tirant_raw.csv", col_names=FALSE ) %>%
rename( chapter = X1, text = X2 ) %>%
mutate( page = as.numeric( rownames(.) ) )

# Tokenize
tirant_words <-
tirant_tbl %>%
unnest_tokens( word, text )

# Pivot the data wider to be presented to Stan as a matrix of multinomial samples.
tirant_lengths <-
tirant_words %>%
mutate( word_length = str_length( word ) ) %>%
mutate( word_length = ifelse( word_length > 9, 10, word_length )) %>%
group_by( page, word_length ) %>%
summarise( count = n( )) %>%
pivot_wider( names_from = word_length, values_from = count ) %>%
ungroup %>%
select( -c(page,"5","6","7","8","9","10") ) %>%
select(sort(peek_vars())) %>%
replace( is.na(.), 0 )

if( not( file.exists( "work/multinomial_changepoint_tirant_fit.rds" ) ) ) {

message( "Fitting multinomial model.")
tirant_multinom_fit <-
stan( "multinomial_changepoint.stan",
data=list(
num_obs=487,
num_cats=4,
y = as.matrix( tirant_lengths ),
alpha = rep( 1, 4 ) ),
iter=16000,
control=list(
adapt_delta=0.98,
max_treedepth=15 ) )

saveRDS( tirant_multinom_fit, "work/multinomial_changepoint_tirant_fit.rds" )

} else {
message( "Loading saved multinomial model.")
tirant_multinom_fit <- readRDS( "work/multinomial_changepoint_tirant_fit.rds" )
}

# Plot the calculated changepoint probabilities.
# (‘changepoint_simplex’).
mean_changepoint_prob <-
extract( tirant_multinom_fit )$changepoint_simplex %>%
as_tibble( .name_repair="unique" ) %>%
summarise_all( mean ) %>%
pivot_longer( everything() ) %>%
rowid_to_column()

# Save values for plotting
saveRDS( mean_changepoint_prob, file="work/mean_changepoint_prob_tirant.rds" )

[/code]

Tirant lo Blanc data file

Tirant lo Blanc formatted raw text data.

Tirant lo Blanc changepoint plot code

multinomial_changepoint_tirant_plot.r
[code language=”r”]

library( tidyverse )
library( magrittr )

library( ggthemes )
library( showtext )

library( grimoire ) #

# Fonts
font_add( "main_font", "resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf")
font_add( "bold_font", "resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf")

showtext_auto()

mean_changepoint_prob <-
readRDS( "work/mean_changepoint_prob_tirant.rds" )

changepoint_plot <-
ggplot( mean_changepoint_prob ) +
geom_col( aes( x=rowid, y=value ), fill=weird_colours["blood"] ) +
labs( x="Chapter", y="Probability of Changepoint" ) +
theme(
panel.background = element_rect(fill = "transparent", colour = "transparent"),
plot.background = element_rect(fill = "transparent", colour = "transparent"),
plot.title = element_text( family="bold_font", colour=weird_colours["ink"], size=22 ),
plot.subtitle = element_text( family="bold_font", colour=weird_colours["ink"], size=12 ),
axis.text = element_text( family="bold_font", colour=weird_colours["ink"], size=12 ),
axis.title.x = element_text( family="bold_font", colour=weird_colours["ink"], size=12 ),
axis.title.y = element_text( family="bold_font", angle=90, colour=weird_colours["ink"], size=12 ),
axis.line = element_line( colour=weird_colours["ink"] ),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank()
)

# grimoire::decorate_plot() from
parchment_plot <-
decorate_plot(
title="Tirant lo Blanc Chapter Changepoint Probability",
subtitle="http://www.weirddatascience.net | @WeirdDataSci",
plot=changepoint_plot,
bg_image="resources/img/parchment.jpg",
footer="Data: http://einesdellengua.com/tirantloweb/tirantloblanch.html" )

save_plot("output/multinomial_changepoint_tirant_plot.pdf",
parchment_plot,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

[/code]

Nos et mutamur in illis

The changepoint model realises the dark imaginings of our predecessors. What horrors might it reveal when applied to the Voynich Manuscript? Following the logic of the above analysis, we will focus initially on the shorter words in the Voynich corpus.

To improve model fit we will place a little more information in the prior, setting the $\alpha$ values for the Dirichlet prior to 0.6 to push the multinomial towards more concentrated probabilities. This reflects the gamma distribution of word length frequencies discussed in our earlier analysis. We also combine all single- and two-letter Voynich terms into a single category due to the small number of words falling into these categories³.

The model can now reveal where, if at all, a likely fracture resides in the textual assemblage of the Voynich Manuscript.

Voynich Manuscript changepoint plot

" data-image-caption="

Voynich Manuscript changepoint plot

Voynich Manuscript Changepoint | (PDF Version)

Voynich Manuscript word length analysis code

multinomial_changepoint_voynich.r
[code language=”r”]

library( tidyverse )
library( tidyselect )
library( magrittr )

library( rstan )

library( tidytext )

# Load Voynich data
message( "Reading raw Voynich data…" )
voynich_tbl <-
read_csv( "data/voynich_raw.csv", col_names=FALSE ) %>%
rename( folio = X1, text = X2 )

# Tokenize
voynich_words <-
voynich_tbl %>%
unnest_tokens( word, text )

# Calculate the lengths of words
voynich_pure_lengths <-
voynich_words %>%
transmute( word_length = str_length( word ) )

voynich_pure_lengths$count <- 1

# Pivot the data wider to be presented to Stan as a matrix of multinomial samples.
voynich_lengths <-
voynich_words %>%
mutate( word_length = str_length( word ) ) %>%
mutate( word_length = ifelse( word_length > 8, 9, word_length )) %>%
mutate( word_length = ifelse( word_length < 2, 2, word_length )) %>%
group_by( folio, word_length ) %>%
summarise( count = n( )) %>%
pivot_wider( names_from = word_length, values_from = count ) %>%
ungroup %>%
select( -c("folio", "5", "6", "7", "8", "9" )) %>%
select(sort(peek_vars())) %>%
replace( is.na(.), 0 )

if( not( file.exists( "work/multinomial_changepoint_voynich_fit.rds" ) ) ) {

message( "Fitting multinomial model.")
voynich_seed <- 1912
num_cats <- ncol( voynich_lengths )
voynich_multinomial_fit <-
stan( "multinomial_changepoint.stan",
data=list(
num_obs=226,
num_cats=num_cats,
y = as.matrix( voynich_lengths ),
alpha = rep( 0.6, num_cats ) ),
chains=4,
iter=8000, seed=voynich_seed,
control = list( adapt_delta=0.99,
max_treedepth=12 ) )

saveRDS( voynich_multinomial_fit, "work/multinomial_changepoint_voynich_fit.rds" )

} else {
message( "Loading saved multinomial model.")
voynich_multinomial_fit <- readRDS( "work/multinomial_changepoint_voynich_fit.rds" )
}

# Plot the calculated changepoint probabilities.
# (‘changepoint_simplex’).
mean_changepoint_prob <-
extract( voynich_multinomial_fit )$changepoint_simplex %>%
as_tibble( .name_repair="unique" ) %>%
summarise_all( mean ) %>%
pivot_longer( everything() ) %>%
rowid_to_column()

# Plot the calculated changepoint probabilities.
# (‘changepoint_simplex’).
mean_log_p <-
extract( voynich_multinomial_fit )$log_p %>%
as_tibble( .name_repair="unique" ) %>%
summarise_all( mean ) %>%
pivot_longer( everything() ) %>%
rowid_to_column()

# Save mean values for plotting
saveRDS( mean_changepoint_prob, file="work/mean_changepoint_prob_voynich.rds" )

[/code]

Voynich Manuscript data file

Voynich Manuscript formatted raw text data.

Voynich Manuscript changepoint plot code

multinomial_changepoint_voynich_plot.r
[code language=”r”]

library( tidyverse )
library( magrittr )

library( ggthemes )
library( showtext )

library( cowplot )
library( grimoire ) # https://github.com/weirddatascience/grimoire

# Fonts
font_add( "voynich_font", "resources/fonts/voynich/eva1.ttf")
font_add( "main_font", "resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf")
font_add( "bold_font", "resources/fonts/alchemy/1651 Alchemy/1651AlchemyNormal.otf")

showtext_auto()

mean_changepoint_prob <-
readRDS( "work/mean_changepoint_prob_voynich.rds" )

changepoint_plot <-
ggplot( mean_changepoint_prob ) +
geom_col( aes( x=rowid, y=value ), fill=weird_colours["blood"] ) +
labs( x="Folio", y="Probability of Changepoint" ) +
theme(
panel.background = element_rect(fill = "transparent", colour = "transparent"),
plot.background = element_rect(fill = "transparent", colour = "transparent"),
plot.title = element_text( family="bold_font", colour=weird_colours["ink"], size=22 ),
plot.subtitle = element_text( family="bold_font", colour=weird_colours["ink"], size=12 ),
axis.text = element_text( family="bold_font", colour=weird_colours["ink"], size=12 ),
axis.title.x = element_text( family="bold_font", colour=weird_colours["ink"], size=12 ),
axis.title.y = element_text( family="bold_font", colour=weird_colours["ink"], angle=90, size=12 ),
axis.line = element_line( colour=weird_colours["ink"] ),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank()
)

# grimoire::decorate_plot() from
parchment_plot <-
decorate_plot(
title="Voynich Folio Word-Length Changepoint",
subtitle="http://www.weirddatascience.net | @WeirdDataSci",
plot=changepoint_plot,
bg_image="resources/img/parchment.jpg",
footer="Data: http://www.voynich.nu" )

save_plot("output/multinomial_changepoint_voynich_plot.pdf",
parchment_plot,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

[/code]

Our model appears to identify a potential changepoint in the frequency of short words in the Voynich Manuscript somewhere around Folio 33.

In contrast to the analysis of Tirant lo Blanc, we have no a priori suspicion that multiple authors were involved in the creation of the Voynich Manuscript. As such, we may hypothesise this changepoint as reflecting a simple stylistic shift, a shift in content, or, as with the previous analysis, a shift in authorship.

Mutatis Mutandis

Recalling the topic model from the previous post, and the manually-assigned topics hinted at by the diagrams in the manuscript, Folio 33 might not immediately arouse our suspicions as the most obvious candidate for such a shift in style, falling as it does some way through the manually-identified herbal section, and without an immediately apparent shift in the distribution of topics.

For simplicity of presentation and analysis we will work with the alternative 12-topic model suggested by the metrics in the previous post rather than the 34-topic model initially given there⁴. The distribution of topics in this model can be presented as those of the 34-topic model were in our previous post.

Voynich folio-topic heatmap

" data-image-caption="

Voynich folio-topic heatmap

Voynich Manuscript Folio Topic Heatmap | (PDF Version)

We might, then, ask whether the distribution of topics from this model itself has a changepoint. Having interrogated the distribution of word length frequencies, we can pose the same questions to the distribution of assignments produced by the topic model. Similarly to above: were we to conceive of the topic model assignment for each page as being the result of the roll of some biased die, is there a notable point in the document where the bias of that die seems to shift?

Framed as such, there is little more required to apply this model to the topic assignments. Our Stan model, dredged from its slumber, merely needs to be provided with the topic model folio assignment data.

Voynich Manuscript topic model changepoint

" data-image-caption="

Voynich Manuscript topic model changepoint

" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-1024x576.png" src="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot.png" alt="Voynich Manuscript topic model changepoint" width="1920" height="1080" class="size-full wp-image-2061" srcset="https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-1536x864.png 1536w, https://www.weirddatascience.net/wp-content/uploads/2020/02/multinomial_changepoint_voynich_topic-12_plot-678x381.png 678w" sizes="auto, (max-width: 1920px) 100vw, 1920px" />

Voynich Manuscript topic model changepoint | (PDF Version)

Voynich Manuscript topic model changepoint code

multinomial_changepoint_voynich_topics.r
[code language=”r”]

library( tidyverse )
library( tidyselect )
library( magrittr )

library( rstan )
library( tidytext )

library( gtools ) # (Specifically for `mixedsort` to sort column names numerically.

# Number of topics in model
num_topics <- 12

# Load Voynich topic model data
message( "Reading Voynich topic model data…" )
voynich_tbl <-
readRDS( paste0( "work/topic_identity-", num_topics, ".rds" )) %>%
select( -c( "gamma", "section" ) ) %>%
ungroup

# Pivot the data wider to be presented to Stan as a matrix of samples from a multinomial.
voynich_lengths <-
voynich_tbl %>%
mutate( count=1 ) %>%
pivot_wider( names_from = topic, values_from = "count" ) %>%
select( -document ) %>%
select(mixedsort(peek_vars())) %>%
replace( is.na(.), 0 )

topic_fit_file <- paste0( "work/multinomial_changepoint_voynich_topic_fit-", num_topics, ".rds" )
if( not( file.exists( topic_fit_file ) ) ) {

message( "Fitting multinomial model.")
voynich_topic_multinom_fit <-
stan( "multinomial_changepoint.stan",
data=list(
num_obs=226,
num_cats=num_topics,
y = as.matrix( voynich_lengths ),
alpha = rep( 1, num_topics ) ),
iter=8000,
seed=19300319,
control=list( adapt_delta=0.9 ) )
saveRDS( voynich_multinom_fit, topic_fit_file )

} else {
message( "Loading saved multinomial model.")
voynich_multinom_fit <- readRDS( topic_fit_file )
}

# Extract the calculated changepoint probabilities to a simplex.
mean_changepoint_prob <-
extract( voynich_multinom_fit )$changepoint_simplex %>%
as_tibble( .name_repair="unique" ) %>%
summarise_all( mean ) %>%
pivot_longer( everything() ) %>%
rowid_to_column()

# Save values for plotting
saveRDS( mean_changepoint_prob, file=paste0( "work/mean_changepoint_prob_voynich_topic-", num_topics, ".rds" ) )

[/code]

Voynich Manuscript topic model changepoint plot code

multinomial_changepoint_voynich_topics_plot.r
[code language=”r”]

library( tidyverse )
library( magrittr )

library( ggthemes )
library( showtext )

library( cowplot )
library( grimoire ) # https://github.com/weirddatascience/grimoire

showtext_auto()

# Specify the number of topics in the model
num_topics <- 12

mean_changepoint_prob <-
readRDS( paste0("work/mean_changepoint_prob_voynich_topic-", num_topics, ".rds" ) )

# grimoire::decorate_plot() from
parchment_plot <-
decorate_plot(
title=paste0( "Voynich Folio Changepoint Probability – Topic Model (", num_topics, " topics)"),
subtitle="http://www.weirddatascience.net | @WeirdDataSci",
plot=changepoint_plot,
bg_image="resources/img/parchment.jpg",
footer="Data: http://www.voynich.nu",
rel_heights=c(0.1, 1, 0.05 ))

save_plot( paste0("output/multinomial_changepoint_voynich_topic-", num_topics, "_plot.pdf" ),
parchment_plot,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

[/code]

The location suggested for the topic model changepoint are surprisingly close to the results of the changepoint for the frequency counts of short words in the Voynich Manuscript. Whilst the changepoint probabilities are somewhat more diffuse in the topic model analysis, the most significant probability mass is centered around Folio 38, with much lower spikes extending out as far as Folio 55 and Folio 30.

In addition to the mutual support that these two analyses provide, it is notable that the major changpoint identified by both falls directly in the earlier portion of the first, major “herbal” section identified manually by Voynich scholars through inspection of the images accompanying the text. This suggests that that first section, at least in terms of textual content, is not as homogeneous as has previously been suggested. Future scholars investigating the structure of the Voynich Manuscript may therefore wish to direct more attention towards the earlier middle of the herbal section, around folios 30 to 40, to identify what dreadful changes may emerge at that point in the text.

We might naturally, but do not here, extend this analysis by shattering further the smooth unity of the Manuscript according to multiple changepoints. Such an extension is, as intimated in the guide, conceptually simple but computationally burdensome, as it requires recalculation of multiple potential distributions across an ever increasing number of parameters. As such, we leave this analysis, and the means to conduct it more efficiently, for the dim future.

Omnia mutantur, nihil interit

Our analysis has provided us with an abstracted location in the text at which our unsettling suspicions of change lie. The two folios arousing greatest curiosity are therefore Folio 33 and Folio 38, which we present here to reify our horror. It is worth highlighting, however, that the changepoint analysis says nothing specific about these two folios; the model identifies that the inexplicable scrawling prior to the changepoint differs significantly from the maddeningly incomprehensible glyphs following it, nothing more.

Voynich Folio 33

" data-image-caption="

Voynich Folio 33

Voynich Folio 33

Voynich Folio 38

" data-image-caption="

Voynich Folio 38

Voynich Folio 38

The statistical properties that we have uncovered in the Voynich Manuscript over the past four posts reveal something of its inner structure. It supports, but cannot prove, that the Manuscript is not a hoax, and that it the text is most likely drawn from some natural language.

The changepoint analyses in this post are a powerful tool for identifying evolution and mutation in data, and the demonstrated example of stylometric analysis to Martorel’s Tirant lo Blanc support their use in revealing points of fracture in texts, without reference to the source language.

With specific relevant to the Voynich Manuscript, both the word frequency changepoint and the topic model changepoint suggest that the manscript’s contents shift significantly at some point around Folios 30 to 40. Given the previous assignment of topics based on manual identification of images accompanying the text, this presents a new avenue of investigation for Voynich researchers.

We have resisted that tantalising draw of attempts to translate the Voynich Manuscript. The tools we have applied are more broadly statistical and aim at unveiling structures and revealing patterns in the text; whilst they may provide information towards deciphering the text, that particular conundrum is for the future.

There are, as we would always wish, many avenues left unexplored in this particular labyrinth. The topic model is crude, and more subtle disassemblies could well provide a more refined view. The word frequency patterns support natural language, but we have not made any effort to correlate them with known languages. We have treated words as a unit of analysis, but have not looked in detail at the structure of likely prefixes and suffixes; similar words with differing endings are particularly notable in the topic model, and analysis of these could reveal much more than we have dared to attempt.

There could be much to learn from assessing multiple changepoints in the Manuscript. The presentation of the volume certainly supports its composition of multiple disparate sections; perhaps identifying an inexorable sequence of stylistic shifts could unveil still more of this structure.

For now, however, our meandering journey into the dim twilight of the Voynich Manuscript has drawn to a close, leaving us still searching for illumination in the shadows of this most elegant enigma.

Continue to search, in fear of what you may find.

Code and data for this post: https://github.com/weirddatascience/weirddatascience/tree/master/20200220-voynich04-tempora_mutantur.

Footnotes

Bayes vs. the Invaders! Part Four: Convergence

moth — Sun, 28 Apr 2019 08:38:43 +0000

Sealed and Buried

In the previous three posts⁵ in our series delving into the cosmic horror of UFO sightings in the US, we have descended from the deceptively warm and sunlit waters of basic linear regression, through the increasingly frigid, stygian depths of Bayesian inference, generalised linear models, and the probabilistic programming language Stan.

In this final post we will explore the implications of the murky realms in which we find ourselves, and consider the awful choices that have led us to this point. We will therefore look, with merciful brevity, at the foul truth revealed by our models, but also consider the arcane philosophies that lie sleeping beneath.

Deviant Interpretations

Our crazed wanderings through dark statistical realms have led us eventually to a varying slope, varying intercept negative binomial generalised linear model, whose selection was justified over its simpler cousins via leave-one-out cross-validation (LOO-CV). By interrogating the range of hyperparameters of this model, we could reproduce an alluringly satisfying visual display of the posterior predictive distribution across the United States:

Varying intercept and slope negative binomial GLM of UFO sightings against population.

Varying intercept and slope negative binomial GLM of UFO sightings against population. (PDF Version)

Further, our model provides us with insight into the individual per-state intercept $\alpha$ and slope $\beta$ parameters of the underlying linear model, demonstrating that there is variation between the rate of sightings in US states that cannot be accounted for by their ostensibly human population.

Varying slope and intercept negative binomial GLM parameter plot.

" data-large-file="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-1024x576.png" src="http://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png" alt="" width="1920" height="1080" class="size-full wp-image-705" srcset="https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes.png 1920w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-640x360.png 640w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-300x169.png 300w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-768x432.png 768w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-1024x576.png 1024w, https://www.weirddatascience.net/wp-content/uploads/2019/04/ufo_per-state_intercepts-slopes-64x36.png 64w" sizes="auto, (max-width: 1920px) 100vw, 1920px" />

Varying slope and intercept negative binomial GLM parameter plot for UFO sightings model. (PDF Version)

Interpreting these parameters, however, is not as quite as simple as in a basic linear model⁶. Most importantly our negative binomial GLM employs a log link function to relate the linear model to the data:

$$\begin{eqnarray}
y &\sim& \mathbf{NegBinomial}(\mu, \phi)\\
\log(\mu) &=& \alpha + \beta x\\
\alpha &\sim& \mathcal{N}(0, 1)\\
\beta &\sim& \mathcal{N}(0, 1)\\
\phi &\sim& \mathbf{HalfCauchy}(2)
\end{eqnarray}$$

In a basic linear regression, $y=\alpha+\beta x$, the $\alpha$ parameter can be interpreted as the value of $y$ when $x$ is 0. Increasing the value of $x$ by 1 results in a change in the $y$ value of $\beta$. We have, however, been drawn far beyond such naive certainties.

The $\alpha$ and $\beta$ coefficients in our negative binomial GLM produce the $\log$ of the $y$ value: the mean of the negative binomial in our parameterisation.

With a simple rearrangement, we can being to understand the grim effects of this transformation:

$$\begin{array}
_ & \log(\mu) &=& \alpha + \beta x\\
\Rightarrow &\mu &=& \operatorname{e}^{\alpha + \beta x}\\
\end{array}$$

If we set $x=0$:

$$\begin{eqnarray}
\mu_0 &=& \operatorname{e}^{\alpha}
\end{eqnarray}$$

The mean of the negative binomial when $x$ is 0 is therefore $\operatorname{e}^{\alpha}$. If we increase the value of $x$ by 1:

$$\begin{eqnarray}
\mu_1 &=& \operatorname{e}^{\alpha + \beta}\\
&=& \operatorname{e}^{\alpha} \operatorname{e}^{\beta}
\end{eqnarray}$$

Which, if we recall the definition of the underlying mean of our model’s negative binomial, $\mu_0$, above, is:
$$\mu_0 \operatorname{e}^{\beta}$$

The effect of an increase in $x$ is therefore multiplicative with a log link: each increase of $x$ by 1 causes the mean of the negative binomial to be further multiplied by $\operatorname{e}^{\beta}$.

Despite this insidious complexity, in many senses our naive interpretation of these values still holds true. A higher value for the $\beta$ coefficient does mean that the rate of sightings increases more swiftly with population.

With the full, unsettling panoply of US States laid out before us, any attempt to elucidate their many and varied deviations would be overwhelming. Broadly, we can see that both slope and intercepts are generally restricted to a fairly close range, with the 50% and 95% credible intervals notably overlapping in many cases. Despite this, there are certain unavoidable abnormalities from which we cannot, must not, shrink:

Only Pennsylvania presents a slope ($\beta$) parameter that could be considered as potentially zero, if we consider its 95% credible interval. The correlation between population and number of sightings is otherwise unambiguously positive.
Delaware, whilst presenting a wide credible interval for its slope ($\beta$) parameter, stands out as suffering from the greatest rate of change in sightings as its population increases.
Both California and Utah, present suspiciously narrow credible intervals on their slope ($\beta$) parameters. The growth in sightings as the population increases therefore demonstrates a worrying consistency although, in both cases, this rate is amongst the lowest of all the states.

We can conclude, then, that while the total number of sightings in Delaware are currently low, any increase in numbers of residents there appears to possess a strange fascination for visitors from beyond the inky blackness of space. By contrast, whilst our alien observers have devoted significant resources to monitoring Utah and California, their apparent willingness to devote further effort to tracking those states’ burgeoning populations is low.

Trembling Uncertainty

One of the fundamental elements of the Bayesian approach is its willing embrace of uncertainty. The output of our eldritch inferential processes are not point estimates of the outcome, as in certain other approaches, but instead posterior predictive distributions for those outcomes. As such, if when we turn our minds to predicting new outcomes based on previously unseen data, our outcome is a distribution over possible values rather than a single estimate. Thus, at the dark heart of Bayesian inference is a belief in the truth that all uncertainty be quantified as probability distributions.

The Bayesian approach as inculcated here has a predictive bent to it. These intricate methods lend themselves to forecasting a distribution of possibilities before the future unveils itself. Here, we gain a horrifying glimpse into the emerging occurrence of alien visitations to the US as its people busy themselves about their various concerns, scrutinised and studied, perhaps almost as narrowly as a man with a microscope might scrutinise the transient creatures that swarm and multiply in a drop of water.

Unavoidable Choices

The twisted reasoning underlying this series of posts has been not only in indoctrinating others into the hideous formalities of Bayesian inference, probabilistic programming, and the arcane subtleties of the Stan programming language; but also as an exercise in exposing our own minds to their horrors. As such, there is a tentative method to the madness of some of the choices made in this series that we will now elucidate.

Perhaps the most jarring choice has been to code these models in Stan directly, rather than using one of the excellent helper libraries that allow for more concise generation of the underlying Stan code. Both brms and rstanarm possess the capacity to spawn models such as ours with greater simplicity of specification and efficiency of output, due to a number of arcane tricks. As an exercise in internalising such forbidden knowledge, however, it is useful to address reality unshielded by such swaddling conveniences.

In fabricating models for more practical reasons, however, we would naturally turn to these tools unless our unspeakable demands go beyond their natural scope. As a personal choice, brms is appealing due to its more natural construction of readable per-model Stan code to be compiled. This allows for the grotesque internals of generated models to be inspected and, if required, twisted to whatever form we desire. rstanarm, by contrast, avoids per-model compilation by pre-compiling more generically applicable models, but its underlying Stan code is correspondingly more arcane for an unskilled neophyte.

The Stan models presented in previous posts have also been constructed as simply as possible and have avoided all but the most universally accepted tricks for improving speed and stability⁷. Most notably, Stan presents specific functions for GLMs based on the Poisson and negative binomial distributions that apply standard link functions directly. As mentioned, we consider it more useful for personal and public indoctrination to use the basic, albeit log-form parameterisations.

Last Rites

In concluding the dark descent of this series of posts on Bayesian inference, generalised linear models, and the unearthly effects of extraterrestrial visitions on humanity, we have applied numerous esoteric techniques to identify, describe, and quantify the relationship between human population and UFO sightings. The enigmatic model constructed throughout this and the previous three entries darkly implies that, while the rate of inexplicable aerial phenomena is inextricably and positively linked to humanity’s unchecked growth, there are nonetheless unseen factors that draw our non-terrestrial visitors to certain populations more than others, and that their focus and attention is ever more acute.

This series has inevitably fallen short of a full and meaningful elucidation of the techniques of Bayesian inference and Stan. From this first step on such a path, then, interested students of the bizarre and arcane would be well advised to draw on the following esoteric resources:

Until then, watch the skies and archive your data.

Footnotes

Bayes vs. the Invaders! Part Three: The Parallax View

moth — Wed, 17 Apr 2019 13:35:56 +0000

The Parallax View

In the previous post of this series unveiling the relationship between UFO sightings and population, we crossed the threshold of normality underpinning linear models to construct a generalised linear model based on the more theoretically satisfying Poisson distribution.

On inspection, however, this model revealed itself to be less well suited to the data than we had, in our tragic ignorance, hoped. While it appeared, on visual inspection, to capture some features of the data, the predictive posterior density plot demonstrated that it still fell short of addressing the subtleties of the original.

In this post, we will seek to overcome this sad lack in two ways: firstly, we will subject our models to pitiless mathematical scrutiny to assess their ability to describe the data. With our eyes irrevocably opened to these techniques, we will construct an ever more complex armillary with which to approach the unknowable truth.

Critical Omissions of Information

Our previous post showed the different fit of the Poisson model to the data from the simple Gaussian linear model. When presented with a grim array of potential simulacra, however, it is crucial to have reliable and quantitative mechanisms to select amongst them.

The eldritch procedure most suited to this purpose, model selection, in our framework, draws on information criteria that express the relative effectiveness of models at creating sad mockeries of the original data. The original and most well-known such criterion is the Akaike Information Criterion, which has, in turn, spawned a multitude of successors applicable in different situations and with different properties. Here, we will make use of Leave-One-Out Cross Validation (LOO-CV)⁸ as the most applicable to the style of model and set of techniques applied here.

It is important to reiterate that these approaches do not speak to an absolute underlying truth; information criteria allow us to choose between models, assessing which has most closely assimilated the madness and chaos of the data. For LOO-CV, this results in an expected log predictive density (elpd) for each model. The model with the lowest elpd is the least-warped mirror of reality amongst those we subject to scrutiny.

There are many fragile subtleties to model selection, of which we will mention only two here. Firstly, in general, the greater the number of predictors or variables incorporated into a model, the more closely it will be able to mimic the original data. This is problematic, in that a model can become overfit to the original data and thus be unable to represent previously unseen data accurately — it learns to mimic the form of the observed data at the expense of uncovering its underlying reality. The LOO-CV technique avoids this trap by, in effect, withholding data from the model to assess its ability to make accurate inferences on previously unseen data.

The second consideration in model selection is that the information criteria scores of models, such as (elpd) in LOO-CV, are subject to standard error in their assessment; the score itself is not a perfect metric of model performance, but a cunning approximation. As such we will only consider one model to have outperformed its competitors if the difference in their relative elpd is several times greater than this standard error.

With this understanding in hand, we can now ruthlessly quantify the effectiveness of the Gaussian linear model against the Poisson generalised linear model.

Gaussian vs. the Poisson

The original model presented before our subsequent descent into horror was a simple linear Gaussian, produced through use of ggplot2‘s geom_smooth function. To compare this meaningfully against the Poisson model of the previous post, we must now recreate this model using the, now hideously familar, tools of Bayesian modelling with Stan.

Show Gaussian model specification code.

population_model_normal.stan
[code language=”c”]

data {

// Number of rows (observations)
int observations;

// Predictor (population of state)
vector[ observations ] population;

// Response (counts)
real counts[observations];

}

parameters {

// Intercept
real< lower=0 > a;

// Slope
real< lower=0 > b;

// Standard deviation
real< lower=0 > sigma;
}

model {

// Priors
a ~ normal( 0, 5 );
b ~ normal( 0, 5 );
sigma ~ cauchy( 0, 2.5 );

// Model
counts ~ normal( a + population * b, sigma );

}

generated quantities {

// Posterior predictions
vector[observations] counts_pred;

// Log likelihood (for LOO)
vector[observations] log_lik;

for (n in 1:observations) {

log_lik[n] = normal_lpdf( counts[n] | a + population[n]*b, sigma );
counts_pred[n] = normal_rng( a + population[n]*b, sigma );

}

[/code]

With both models straining in their different directions towards the light, we apply LOO-CV cross validation to assess their effectiveness at predicting the data.

Show LOO-CV comparison code.

LOO-CV R-code snippet. Full code is at the end of this post.
[code language=”R”]

…
# Compare models with LOO
log_lik_normal <- extract_log_lik(fit_ufo_pop_normal, merge_chains = FALSE)
r_eff_normal <- relative_eff(exp(log_lik_normal))
loo_normal <- loo(log_lik_normal, r_eff = r_eff_normal, cores = 2)

log_lik_poisson <- extract_log_lik(fit_ufo_pop_poisson, merge_chains = FALSE)
r_eff_poisson <- relative_eff(exp(log_lik_poisson))
loo_poisson <- loo(log_lik_poisson, r_eff = r_eff_poisson, cores = 2)
…

[/code]

> compare( loo_normal, loo_poisson )
elpd_diff        se 
  -8576.1     712.5

The information criterion shows that the complexity of the Poisson model does not, in fact, produce a more effective model than the false serenity of the Gaussian⁹. The negative elpd_diff of the compare function supports the first of the two models, and the magnitude being over twelve times greater than the standard error leaves little doubt that the difference is significant. We must, it seems, look further.

With these techniques for selecting between models in hand, then, we can move on to constructing ever more complex attempts to dispel the darkness.

Trials without End

The Poisson distribution, whilst appropriate for many forms of count data, suffers from fundamental limits to its understanding. The single parameter of the Poisson, $\lambda$, enforces that the mean and variance of the data are equal. When such comforting falsehoods wither in the pale light of reality, we must move beyond the gentle chains in which the Poisson binds us.

The next horrific evolution, then, is the negative binomial distribution, which similarly speaks to count data, but presents a dispersion parameter ($\phi$) that allows the variance to exceed the mean¹⁰.

With our arcane theoretical library suitably expanded, we can now transplant the still-beating Poisson heart of our earlier generalised linear model with the more complex machinery of the negative binomial:

As with the Poisson, our negative binomial generalised linear model employs a log link function to transform the linear predictor. The Stan code for this model is given below.

Show negative binomial model specification code.

population_model_negbinomial.stan
[code language=”c”]

data {

// Number of rows (observations)
int observations;

// Predictor (population of state)
vector[ observations ] population_raw;

// Response (counts)
int counts[observations];

}

transformed data {

// Center and scale the predictor
vector[ observations ] population;
population = ( population_raw – mean( population_raw ) ) / sd( population_raw );

}

parameters {

// Negative binomial dispersion parameter
real phi;

// Intercept
real a;

// Slope
real b;

}

transformed parameters {

vector[observations] mu;
mu = a + b*population;

}

model {

// Priors
a ~ normal( 0, 1 );
b ~ normal( 0, 1 );
phi ~ cauchy( 0, 5 );

// Model
// Uses the log version of the neg_binomial_2 to avoid
// manual exponentiation of the linear predictor.
// (This avoids numerical problems in the calculations.)
counts ~ neg_binomial_2_log( mu, phi );

}

generated quantities {

vector[observations] counts_pred;
vector[observations] log_lik;

for (n in 1:observations) {

log_lik[n] = neg_binomial_2_log_lpmf( counts[n] | mu[n], phi );
counts_pred[n] = neg_binomial_2_log_rng( mu[n], phi );

}

[/code]

With this model fit, we can compare its whispered falsehoods against both the original linear Gaussian model and the Poisson GLM:

Show LOO-CV comparison code.

Code snippet for calculating the LOO-CV elpd for three models. The full R code for building and comparing all models is listed at the end of this post.

[code] [code language="R"] …
# Compare models with LOO
log_lik_normal <- extract_log_lik(fit_ufo_pop_normal, merge_chains = FALSE)
r_eff_normal <- relative_eff(exp(log_lik_normal))
loo_normal <- loo(log_lik_normal, r_eff = r_eff_normal, cores = 2)

log_lik_poisson <- extract_log_lik(fit_ufo_pop_poisson, merge_chains = FALSE)
r_eff_poisson <- relative_eff(exp(log_lik_poisson))
loo_poisson <- loo(log_lik_poisson, r_eff = r_eff_poisson, cores = 2)

log_lik_negbinom <- extract_log_lik(fit_ufo_pop_negbinom, merge_chains = FALSE)
r_eff_negbinom <- relative_eff(exp(log_lik_negbinom))
loo_negbinom <- loo(log_lik_negbinom, r_eff = r_eff_negbinom, cores = 2)
…
[/code]

> compare( loo_poisson, loo_negbinom )
elpd_diff        se 
   8880.8     721.9

With the first comparison, it is clear that the sinuous flexibility offered by the dispersion parameter, $\phi$, of the negative binomial allows that model to mould itself much more effectively to the data than the Poisson. The elpd_diff score is positive, indicating that the second of the two compared models is favoured; the difference is over twelve times the standard error, giving us confidence that the negative binomial model is meaningfully more effective than the Poisson.

Whilst superior to the Poisson, does this adaptive capacity allow the negative binomial model to render the naïve Gaussian linear model obsolete?

> compare( loo_normal, loo_negbinom )
elpd_diff        se 
    304.7      30.9

The negative binomial model subsumes the Gaussian with little effort. The elpd_diff is almost ten times the standard error in favour of the negative binomial GLM, giving us confidence in choosing it. From here on, we will rely on the negative binomial as the core of our schemes.

Overlapping Realities

The improvements we have seen with the negative binomial model allow us to discard the Gaussian and Poisson models with confidence. It is not, however, sufficient to fill the gaping void induced by our belief that the sightings of abnormal aerial phenomena in differing US states vary differently with their human population.

To address this question we must ascertain whether allowing our models to unpick the individual influence of states will improve their predictive ability. This, in turn, will lead us into the gnostic insanity of hierarchical models, in which we group predictors in our models to account for their shadowy underlying structures.

Limpid Pools

The first step on this path is to allow part of the linear function underpinning our model, specifically the intercept value, $\alpha$, to vary between different US states. In a simple linear model, this causes the line of best fit for each state to meet the y-axis at a different point, whilst maintaining a constant slope for all states. In such a model, the result is a set of parallel lines of fit, rather than a single global truth.

This varying intercept can describe a range of possible phenomena for which the rate of change remains constant, but the baseline value varies. In such hierarchical models we employ a concept known as partial pooling to extract as much forbidden knowledge from the reluctant data as possible.

A set of entirely separate models, such as the per-state set of linear regressions presented in the first post of this series, employs a no pooling approach: the data of each state is treated separately, with an entirely different model fit to each. This certainly considers the uniqueness of each state, but cannot benefit from insights drawn from the broader range of data we have available, which we may reasonably assume to have some relevance.

By contrast, the global Gaussian, Poisson, and negative binomial models presented so far represent complete pooling, in which the entire set of data is considered a formless, protean amalgam without meaningful structure. This mindless, groping approach causes the unique features of each state to be lost amongst the anarchy and chaos.

A partial pooling approach instead builds a global mean intercept value across the dataset, but allows the intercept value for each individual state to deviate according to a governing probability distribution. This both accounts for the individuality of each group of observations, in our case the state, but also draws on the accumulated wisdom of the whole.

We now construct a partially-pooled varying intercept model, in which the parameters and observations for each US state in our dataset is individually indexed:

$$\begin{eqnarray}
y &\sim& \mathbf{NegBinomial}(\mu, \phi)\\
\log(\mu) &=& \alpha_i + \beta x\\
\alpha_i &\sim& \mathcal{N}(\mu_\alpha, \sigma_\alpha)\\
\beta &\sim& \mathcal{N}(0, 1)\\
\phi &\sim& \mathbf{HalfCauchy}(2)
\end{eqnarray}$$

Note that the intercept parameter, $\alpha$, in the second line is now indexed by the state, represented here by the subscript $i$. The slope parameter, $\beta$, remains constant across all states.

This model can be rendered in Stan code as follows:

Show Gaussian model specification code.

population_model_negbinomial_var_intercept.stan
[code language=”c”]

data {

// Number of rows (observations)
int observations;

// Number of states
int< lower=0 > states;

// Vector detailing the US state in which each observation (count of
// counts in a year) occurred
int< lower=1, upper=states > state[ observations ];

// Predictor (population of state)
vector[ observations ] population_raw;

// Response (counts)
int counts[ observations ];

}

transformed data {

// Center and scale the predictor
vector[ observations ] population;
population = ( population_raw – mean( population_raw ) ) / sd( population_raw );

}

parameters {

// Per-state intercepts
vector[ states ] a;

// Mean and SD of distribution from which per-state intercepts are drawn
real< lower=0 > mu_a;
real< lower=0 > sigma_a;

// Negative binomial dispersion parameter
real< lower=0 > phi;

// Slope
real b;

}

transformed parameters {

// Calculate location parameter for negative binomial incorporating
// per-state indicator.
vector[ observations ] eta;

for( i in 1:observations ) {
eta[i] = a[ state[i] ] + population[i] * b;
}
}

model {

mu_a ~ normal(0, 1);
sigma_a ~ cauchy(0, 2);

// Priors
a ~ normal ( mu_a, sigma_a );
b ~ normal( 0, 1 );
phi ~ cauchy( 0, 2 );

// Model
counts ~ neg_binomial_2_log( eta, phi );

}

generated quantities {

vector[observations] counts_pred;
vector[observations] log_lik;

vector[observations] mu;
mu = exp( eta );

for (n in 1:observations) {

log_lik[n] = neg_binomial_2_log_lpmf( counts[n] | eta[n], phi );
counts_pred[n] = neg_binomial_2_log_rng( eta[n], phi );

}

[/code]

Once the model has twisted itself into the most appropriate form for our data, we can now compare it against our previous completely-pooled model:

Show LOO-CV comparison code.

Code snippet for comparing models via LOO-CV. Full code at the end of this post.

[code] …
# Compare models with LOO
log_lik_negbinom_var_intercept <- extract_log_lik(fit_ufo_pop_negbinom_var_intercept, merge_chains = FALSE)
r_eff_negbinom_var_intercept <- relative_eff(exp(log_lik_negbinom_var_intercept))
loo_negbinom_var_intercept <- loo(log_lik_negbinom_var_intercept, r_eff = r_eff_negbinom_var_intercept, cores = 2)
…
[/code]

> compare( loo_negbinom, loo_negbinom_var_intercept )
elpd_diff        se 
    363.2      28.8

Our transcendent journey from the statistical primordial ooze continues: the varying intercept model is favoured over the completely-pooled model by a significant margin.

Sacred Geometry

Now that our minds have apprehended a startling glimpse of the implications of the varying intercept model, it is natural to consider taking a further terrible step and allowing both the slope and the intercept to vary¹¹.

With both the intercept and slope of the underlying linear predictor varying, an additional complexity raises its head: can we safely assume that these parameters, the intercept and slope, vary independently of each other, or may there be arcane correlations between them? Do states with a higher intercept also experience a higher slope in general, or is the opposite the case? Without prior knowledge to the contrary, we must allow our model to determine these possible correlations, or we are needlessly throwing away potential information in our model.

For a varying slope and intercept model, therefore, we must now include a correlation matrix, $\Omega$, between the parameters of the linear predictor for each state in our model. This correlation matrix, as with all parameters in a Bayesian framework, must be expressed with a prior distribution from which the model can begin its evaluation of the data.

With deference to the authoritative quaint and curious volume of forgotten lore we will use an LKJ prior for the correlation matrix without further discussion of the reasoning behind it.

$$\begin{eqnarray}
y &\sim& \mathbf{NegBinomial}(\mu, \phi)\\
\log(\mu) &=& \alpha_i + \beta x_i\\
\begin{bmatrix}
\alpha_i\\
\beta_i
\end{bmatrix} &\sim& \mathcal{N}(
\begin{bmatrix}
\mu_\alpha\\
\mu_\beta
\end{bmatrix}, \Omega )\\
\Omega &\sim& \mathbf{LKJCorr}(2)\\
\phi &\sim& \mathbf{HalfCauchy}(2)
\end{eqnarray}$$

This model has grown and gained a somewhat twisted complexity compared with the serene austerity of our earliest linear model. Despite this, each further step in the descent has followed its own perverse logic, and the progression should clear. The corresponding Stan code follows:

Show negative binomial varying intercept and slope model specification code.

population_model_negbinomial_var_intercept_slope.stan
[code]

data {

// Number of rows (observations)
int observations;

// Number of states
int< lower=0 > states;

// Vector detailing the US state in which each observation (count of
// counts in a year) occurred
int< lower=1, upper=states > state[ observations ];

// Predictor (population of state)
vector[ observations ] population_raw;

// Response (counts)
int counts[ observations ];

}

transformed data {

// Center and scale the predictor
vector[ observations ] population;
population = ( population_raw – mean( population_raw ) ) / sd( population_raw );

}

parameters {

// Per-state intercepts and slopes
vector[ states ] state_intercept;
vector[ states ] state_slope;

// Baseline intercept and slope from which each group deviates.
real pop_intercept;
real pop_slope;

// Per-state standard deviations for intercept and slope
vector< lower=0 >[2] state_sigma;

// Negative binomial dispersion parameter
real< lower=0 > phi;

// Parameter correlation matrix
corr_matrix[2] omega;

}

transformed parameters {

vector[2] vec_intercept_slope[ states ];
vector[2] mu_intercept_slope;

// Location parameter
vector[observations] eta;

// Per-state intercepts and slopes
for( i in 1:states ) {

vec_intercept_slope[ i, 1] = state_intercept[i];
vec_intercept_slope[ i, 2] = state_slope[i];

}

// Population slope and intercept
mu_intercept_slope[1] = pop_intercept;
mu_intercept_slope[2] = pop_slope;

// Calculation negbinomial location parameter
for( i in 1:observations ) {
eta[i] = state_intercept[ state[i] ] + state_slope[ state[i]] * population[ i ];
}

}

model {

// Priors
omega ~ lkj_corr(2);
phi ~ cauchy(0, 3 );
state_sigma ~ cauchy( 0, 3 );

pop_intercept ~ normal( 0, 1 );
pop_slope ~ normal( 0, 1 );

vec_intercept_slope ~ multi_normal( mu_intercept_slope, quad_form_diag( omega, state_sigma ) );

// Model
counts ~ neg_binomial_2_log( eta, phi );

}

generated quantities {

vector[observations] counts_pred;
vector[observations] log_lik;

for (n in 1:observations) {

log_lik[n] = neg_binomial_2_log_lpmf( counts[n] | eta[n], phi );
counts_pred[n] = neg_binomial_2_log_rng( eta[n], phi );

}

[/code]

The ultimate test of our faith, then, is whether the added complexity of the partially-pooled varying slope, varying intercept model is justified. Once again, we turn to the ruthless judgement of the LOO-CV:

Show LOO-CV comparison code.

Code snippet for calculating the LOO-CV elpd. The full R code for building and comparing all models in this post is listed at the end.

[code language=”R”] …
log_lik_negbinom_var_intercept_slope <- extract_log_lik(fit_ufo_pop_negbinom_var_intercept_slope, merge_chains = FALSE)
r_eff_negbinom_var_intercept_slope <- relative_eff(exp(log_lik_negbinom_var_intercept_slope))
loo_negbinom_var_intercept_slope <- loo(log_lik_negbinom_var_intercept_slope, r_eff = r_eff_negbinom_var_intercept_slope, cores = 2)
…
[/code]

> compare( loo_negbinom_var_intercept, loo_negbinom_var_intercept_slope )
elpd_diff        se 
     13.3       2.4

In this final step we can see that our labours in the arcane have been rewarded. The final model is once again a significant improvement over its simpler relatives. Whilst the potential for deeper and more perfect models never ends, we will settle for now on this.

Mortal Consequences

With our final model built, we can now begin to examine its mortifying implications. We will leave the majority of the subjective analysis for the next, and final, post in this series. For now, however, we can reinforce our quantitative analysis with visual assessment of the posterior predictive distribution output of our final model.

Posterior predictive density plot of varying intercept, varying slope negative binomial model of UFO sightings.

Posterior predictive density plot of varying intercept, varying slope negative binomial GLM of UFO sightings. (PDF Version)

Show posterior predictive plotting code.

bayes_plots.r
(Includes code to generate traceplot and posterior predictive distribution plot.)
[code language=”R”]

library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )

# Load UFO data
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

# UFO reporting font
font_add( "main_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
font_add( "bold_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
showtext_auto()

# Plots, posterior predictive checking, LOO.

# (Visualisations only produced for varying slope/intercept model, as a result
# of LOO checking.

# Bayesplot needs to be told which theme to use as a default.
theme_set( theme_weird() )

# Read the fitted model
fit_ufo_pop_negbinom_var_intercept_slope <-
readRDS( "work/fit_ufo_pop_negbinom_var_intercept_slope.rds" )

## Model checking visualisations

# Extract posterior estimates from the fit (from the generated quantities of the stan model)
counts_pred_negbinom_var_intercept_slope <- as.matrix( fit_ufo_pop_negbinom_var_intercept_slope, pars = "counts_pred" )

# First, as always, a traceplot
tp <-
traceplot(
fit_ufo_pop_negbinom_var_intercept_slope,
pars = c("pop_intercept", "pop_slope", "phi" ),
ncol=1 ) +
scale_colour_viridis_d( name="Chain", direction=-1 ) +
theme_weird()

title <-
ggdraw() +
draw_label("Traceplot of Key Parameters", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.40)

titled_tp <-
plot_grid(title, tp, ncol=1, rel_heights=c(0.1, 1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/traceplot.pdf",
titled_tp,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

# Posterior predictive density. (Visual representation of goodness of fit.)
gp_ppc <-
ppc_dens_overlay(
y = extract2( ufo_population_sightings, "count" ),
yrep = counts_pred_negbinom_var_intercept_slope ) +
theme_weird()

title <-
ggdraw() +
draw_label("Posterior Predictive Density Plot", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.40)

titled_gp_ppc <-
plot_grid(title, gp_ppc, ncol=1, rel_heights=c(0.1, 1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/posterior_predictive.pdf",
titled_gp_ppc,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

[/code]

In comparison with earlier attempts, the varying intercept and slope model visibly captures the overall shape of the distribution with terrifying ease. As our wary confidence mounts in the mindless automaton we have fashioned, we can now examine its predictive ability on our original data.

Varying intercept and slope negative binomial GLM of UFO sightings against population.

Varying intercept and slope negative binomial GLM of UFO sightings against population. (PDF Version)

Show negative binomial varying intercept and slope plot code.

population_plot.r
[code]

library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )
library( modelr )

# Load UFO data and model
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

fit_ufo_pop_negbinom_var_intercept_slope <-
readRDS("work/fit_ufo_pop_negbinom_var_intercept_slope.rds")
#readRDS("work/fit_ufo_pop_normal_var_intercept_slope.rds")

# UFO reporting font
font_add( "main_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
font_add( "bold_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
showtext_auto()

# Plots, posterior predictive checking, LOO
theme_set( theme_weird() )

## Model checking visualisations

## Create per-state predictive fit plots

# Convert fitted model (stanfit) object to a tibble
fit_tbl <-
summary(fit_ufo_pop_negbinom_var_intercept_slope)$summary %>%
as.data.frame() %>%
mutate(variable = rownames(.)) %>%
select(variable, everything()) %>%
as_tibble()

counts_predicted <-
fit_tbl %>%
filter( str_detect(variable,’counts_pred’) )

ufo_population_sightings_pred <-
ufo_population_sightings %>%
ungroup() %>%
mutate( count_mean = counts_predicted$mean,
lower = counts_predicted$`25%`,
upper = counts_predicted$`75%`)

# (Using mean and SD of fit summary)
predictive_plot <-
ggplot( ufo_population_sightings_pred ) +
geom_point( aes( x=population, y=count, colour=state ), size=0.6, alpha=0.8 ) +
geom_line(aes( x=population, y=count_mean, colour=state )) +
geom_ribbon(aes(x=population, ymin = lower, ymax = upper, fill=state), alpha = 0.25) +
labs( x="Population (Thousands)", y="Annual Sightings" ) +
scale_fill_viridis_d( name="State" ) +
scale_colour_viridis_d( name="State" ) +
theme(
axis.title.y = element_text( angle=90 ),
legend.position = "none" )

# Construct full plot, with title and backdrop.
title <-
ggdraw() +
draw_label("UFO Sightings against State Population (1990-2014)", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("Negative Binomial Hierarchical GLM. Varying slope and intercept. 50% credible intervals.", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.48) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.16)

data_label <- ggdraw() +
draw_label("Data: http://www.nuforc.org | Tool: http://www.mc-stan.org", fontfamily="main_font", colour = "#cccccc", size=12, hjust=1, x=0.98 )

predictive_plot_titled <-
plot_grid(title, predictive_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/predictive_plot.pdf",
predictive_plot_titled,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

[/code]

The purpose of our endeavours is to show whether or not the frequency of extraterrestrial visitations is merely a sad reflection of the number of unsuspecting humans living in each state. After seemingly endless cryptic calculations, our statistical machinery implies that there are deeper mysteries here: allowing the relationship between sightings and the underlying linear predictors to vary by state more perfectly predicts the data. There are clearly other, hidden, factors in play.

More than that, however, our final model allows us to quantify these differences. We can now retrieve from the very bowels of our inferential process the per-state distribution of paremeters for both the slope and intercept of the linear predictor.

Varying slope and intercept negative binomial GLM parameter plot.

Varying slope and intercept negative binomial GLM parameter plot for UFO sightings model. (PDF Version)

Show per-state intercept and slope plotting code.

slope_intercept_plot.r
[code language=”R”]

library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )
library( modelr )

# Load UFO data and model
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

fit_ufo_pop_negbinom_var_intercept_slope <-
readRDS("work/fit_ufo_pop_negbinom_var_intercept_slope.rds")
#readRDS("work/fit_ufo_pop_normal_var_intercept_slope.rds")

# UFO reporting font
font_add( "main_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
font_add( "bold_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
showtext_auto()

# Plots, posterior predictive checking, LOO
theme_set( theme_weird() )

# Use teal colour scheme
color_scheme_set( "teal")

## Model checking visualisations

# US state data
us_state_factors <-
levels( factor( ufo_population_sightings$state ) )

# US state names for nice plotting
# Data:
state_code_data <-
read_csv( file="data/us_states.csv" ) %>%
filter( code %in% us_state_factors )

# Rename variables back to state names
posterior_intercepts <-
as.data.frame( fit_ufo_pop_negbinom_var_intercept_slope ) %>%
as_tibble %>%
select(starts_with(‘state_intercept’) ) %>%
rename_all( ~us_state_factors ) %>%
rename_all( ~extract2( state_code_data, "us_state" ) )

# Rename variables back to state names
posterior_slopes <-
as.data.frame( fit_ufo_pop_negbinom_var_intercept_slope ) %>%
as_tibble %>%
select(starts_with(‘state_slope’) ) %>%
rename_all( ~us_state_factors ) %>%
rename_all( ~extract2( state_code_data, "us_state" ) )

# Posterior draws combined
posterior_slopes_long <-
posterior_slopes %>%
gather( value = "slope" )

posterior_intercepts_long <-
posterior_intercepts %>%
gather( value = "intercept" )

posterior_draws_long <-
bind_cols( posterior_intercepts_long, posterior_slopes_long ) %>%
select( -key1 ) %>%
transmute( state = key, intercept, slope )

# Interval plots (slope and intervals)
# Plot intercept parameters for varying intercept and slope model
gp_intercept <-
mcmc_intervals( posterior_intercepts ) +
ggtitle( "Intercepts" ) +
theme_weird()

# Plot slope parameters for varying intercept and slope model
# (Remove y-axis labels as this will be aligned with the intercept plot.)
gp_slope <-
mcmc_intervals( posterior_slopes ) +
ggtitle( "Slopes" ) +
theme_weird() +
theme(
axis.text.y = element_blank()
)

gp_slope_intercept <-
plot_grid( gp_intercept, gp_slope, ncol=2 )

# Construct full plot, with title and backdrop.
title <-
ggdraw() +
draw_label("Per-State UFO Intercepts and Slopes", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("Mean value, 50% credible interval, and 95% credible interval shown.", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.48) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.16)

data_label <- ggdraw() +
draw_label("Data: http://www.nuforc.org | Tool: http://www.mc-stan.org", fontfamily="main_font", colour = "#cccccc", size=12, hjust=1, x=0.98 )

gp_slope_intercept_titled <-
plot_grid(title, gp_slope_intercept, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) +
theme( panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"))

save_plot("output/ufo_per-state_intercepts-slopes.pdf",
gp_slope_intercept_titled,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

[/code]

It is important to note that, while we are still referring to the $\alpha$ and $\beta$ parameters as the slope and intercept, their interpretation is more complex in a generalised linear model with a $\log$ link function than in the simple linear model. For now, however, this diagram is sufficient to show that the horror visited on innocent lives by our interstellar visitors is not purely arbitrary, but depends at least in part on geographical location.

With this malign inferential process finally complete we will turn, in the next post, to a trembling interpretation of the model and its dark implications for our collective future.

Model Fitting and Comparison Code Listing

Show full model fitting and LOO-CV comparison code.

This code fits the range of models developed in this series, relying on the individual Stan source code files, and runs the LOO-CV comparisons discussed in this post.

population_model.r
[code] library( tidyverse )
library( magrittr )

library( ggplot2 )
library( showtext )

library( rstan )
library( tidybayes )
library( loo )

# Load UFO data
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

## Simple Models
## Complete Pooling — all states considered identical.

# Fit model of UFO sightings (Normal)
if( not( file.exists( "work/fit_ufo_pop_normal.rds" ) ) ) {

message("Fitting basic Normal model.")

fit_ufo_pop_normal <-
stan( file="model/population_model_normal.stan",
data=list(
observations = nrow( ufo_population_sightings ),
population = extract2( ufo_population_sightings, "population" ),
counts = extract2( ufo_population_sightings, "count" )
)
)

saveRDS( fit_ufo_pop_normal, "work/fit_ufo_pop_normal.rds" )

message("Basic Normal model fitted.")

} else {

fit_ufo_pop_normal <- readRDS( "work/fit_ufo_pop_normal.rds" )

}

# Fit model of UFO sightings (Poisson)
if( not( file.exists( "work/fit_ufo_pop_poisson.rds" ) ) ) {

message("Fitting basic Poisson model.")

fit_ufo_pop_poisson <-
stan( file="model/population_model_poisson.stan",
data=list(
observations = nrow( ufo_population_sightings ),
population_raw = extract2( ufo_population_sightings, "population" ),
counts = extract2( ufo_population_sightings, "count" )
)
)

saveRDS( fit_ufo_pop_poisson, "work/fit_ufo_pop_poisson.rds" )

message("Basic Poisson model fitted.")

} else {

fit_ufo_pop_poisson <- readRDS( "work/fit_ufo_pop_poisson.rds" )

}

# Fit model of UFO sightings (Negative Binomial)
if( not( file.exists( "work/fit_ufo_pop_negbinom.rds" ) ) ) {

message("Fitting basic negative binomial model.")

fit_ufo_pop_negbinom <-
stan( file="model/population_model_negbinomial.stan",
data=list(
observations = nrow( ufo_population_sightings ),
population_raw = extract2( ufo_population_sightings, "population" ),
counts = extract2( ufo_population_sightings, "count" ) )
)

saveRDS( fit_ufo_pop_negbinom, "work/fit_ufo_pop_negbinom.rds" )

message("Basic negative binomial model fitted.")

} else {

fit_ufo_pop_negbinom <- readRDS( "work/fit_ufo_pop_negbinom.rds" )

}

## Multilevel Models
## Partial Pooling (Varying Intercept)

if( not( file.exists( "work/fit_ufo_pop_negbinom_var_intercept.rds" ) ) ) {

message("Fitting varying intercept negative binomial model.")

fit_ufo_pop_negbinom_var_intercept <-
stan( file="model/population_model_negbinomial_var_intercept.stan",
data=list(
observations = nrow( ufo_population_sightings ),
population_raw = extract2( ufo_population_sightings, "population" ),
counts = extract2( ufo_population_sightings, "count" ),
states = length( unique( ufo_population_sightings$state ) ),
state = as.numeric( factor( ufo_population_sightings$state ) )
),
chains=4, iter=2000,
control = list(max_treedepth = 15, adapt_delta=0.9)
)

saveRDS( fit_ufo_pop_negbinom_var_intercept, "work/fit_ufo_pop_negbinom_var_intercept.rds" )

message("Varying intercept negative binomial model fitted.")

} else {

fit_ufo_pop_negbinom_var_intercept <- readRDS( "work/fit_ufo_pop_negbinom_var_intercept.rds" )

}

### Partial Pooling (Varying intercept and slope.)

if( not( file.exists( "work/fit_ufo_pop_negbinom_var_intercept_slope.rds" ) ) ) {

message("Fitting varying intercept and slope negative binomial model")

fit_ufo_pop_negbinom_var_intercept_slope <-
stan( file="model/population_model_negbinomial_var_intercept_slope.stan",
data=list(
observations = nrow( ufo_population_sightings ),
population_raw = extract2( ufo_population_sightings, "population" ),
counts = extract2( ufo_population_sightings, "count" ),
states = length( unique( ufo_population_sightings$state ) ),
state = as.numeric( factor( ufo_population_sightings$state ) )
),
chains=4, iter=2000,
control = list(max_treedepth = 12, adapt_delta=0.8)
)

saveRDS( fit_ufo_pop_negbinom_var_intercept_slope, "work/fit_ufo_pop_negbinom_var_intercept_slope.rds" )

message("Varying intercept and slope negative binomial model fitted.")

} else {

fit_ufo_pop_negbinom_var_intercept_slope <- readRDS( "work/fit_ufo_pop_negbinom_var_intercept_slope.rds" )

}

# Hierarchical normal. (Linear regression)
if( not( file.exists( "work/fit_ufo_pop_normal_var_intercept_slope.rds" ) ) ) {

message("Fitting varying intercept and slope normal model")

fit_ufo_pop_normal_var_intercept_slope <-
stan( file="model/population_model_normal_var_intercept_slope.stan",
data=list(
observations = nrow( ufo_population_sightings ),
population_raw = extract2( ufo_population_sightings, "population" ),
counts = extract2( ufo_population_sightings, "count" ),
states = length( unique( ufo_population_sightings$state ) ),
state = as.numeric( factor( ufo_population_sightings$state ) )
),
chains=4, iter=2000,
control = list(max_treedepth = 15, adapt_delta=0.9)
)

saveRDS( fit_ufo_pop_normal_var_intercept_slope, "work/fit_ufo_pop_normal_var_intercept_slope.rds" )

message("Varying intercept and slope normal model fitted.")

} else {

fit_ufo_pop_normal_var_intercept_slope <- readRDS( "work/fit_ufo_pop_normal_var_intercept_slope.rds" )

}

## Notify by text
message("All models fit.")

# Compare models with LOO
log_lik_normal <- extract_log_lik(fit_ufo_pop_normal, merge_chains = FALSE)
r_eff_normal <- relative_eff(exp(log_lik_normal))
loo_normal <- loo(log_lik_normal, r_eff = r_eff_normal, cores = 2)

log_lik_poisson <- extract_log_lik(fit_ufo_pop_poisson, merge_chains = FALSE)
r_eff_poisson <- relative_eff(exp(log_lik_poisson))
loo_poisson <- loo(log_lik_poisson, r_eff = r_eff_poisson, cores = 2)

log_lik_negbinom_var_intercept <- extract_log_lik(fit_ufo_pop_negbinom_var_intercept, merge_chains = FALSE)
r_eff_negbinom_var_intercept <- relative_eff(exp(log_lik_negbinom_var_intercept))
loo_negbinom_var_intercept <- loo(log_lik_negbinom_var_intercept, r_eff = r_eff_negbinom_var_intercept, save_psis = TRUE)

log_lik_negbinom_var_intercept_slope <- extract_log_lik(fit_ufo_pop_negbinom_var_intercept_slope, merge_chains = FALSE)
r_eff_negbinom_var_intercept_slope <- relative_eff(exp(log_lik_negbinom_var_intercept_slope))
loo_negbinom_var_intercept_slope <- loo(log_lik_negbinom_var_intercept_slope, r_eff = r_eff_negbinom_var_intercept_slope, save_psis = TRUE)

normal_poisson_comparison <- compare( loo_normal, loo_poisson )
poiss_negbinom_comparison <- compare( loo_poisson, loo_negbinom )
negbinom_negbinom_var_intercept_comparison <- compare( loo_negbinom, loo_negbinom_var_intercept )
negbinom_var_intercept_negbinom_var_intercept_slope_comparison <- compare( loo_negbinom_var_intercept, loo_negbinom_var_intercept_slope )

saveRDS( normal_poisson_comparison, "work/normal_poisson_comparison.rds" )
saveRDS( poiss_negbinom_comparison, "work/poiss_negbinom_comparison.rds" )
saveRDS( negbinom_negbinom_var_intercept_comparison, "work/negbinom_negbinom_var_intercept_comparison.rds" )
saveRDS( negbinom_var_intercept_negbinom_var_intercept_slope_comparison, "work/negbinom_var_intercept_negbinom_var_intercept_slope_comparison.rds" )

[/code]

Footnotes

Bayes vs. the Invaders! Part Two: Abnormal Distributions

moth — Mon, 08 Apr 2019 14:48:58 +0000

Crossing the Line

This post continues our series on developing statistical models to explore the arcane relationship between UFO sightings and population. The previous post is available here: Bayes vs. the Invaders! Part One: The 37th Parallel.

The simple linear model developed in the previous post is far from satisfying. It makes many unsupportable assumptions about the data and the form of the residual errors from the model. Most obviously, it relies on an underlying Gaussian (or normal) distribution for its understanding of the data. For our count data, some basic features of the Guassian are inappropriate.

Most notably:

a Gaussian distribution is continuous whilst counts are discrete — you can’t have 2.3 UFO sightings in a given day;
the Gaussian can produce negative values, which are impossible when dealing with counts — you can’t have a negative number of UFO sightings;
the Gaussian is symmetrical around its mean value whereas count data is typically skewed.

Moving from the safety and comfort of basic linear regression, then, we will delve into the madness and chaos of generalized linear models that allow us to choose from a range of distributions to describe the relationship between state population and counts of UFO sightings.

Basic Models

We will be working in a Bayesian framework, in which we assign a prior distribution to each parameter that allows, and requires, us to express some prior knowledge about the parameters of interest. These priors are the initial starting points for parameters Afrom which the model moves towards the underlying values as it learns from the data. Choice of priors can have significant effects not only on the outputs of the model, but also its ability to function effectively; as such, it is both an important, but also arcane and subtle, aspect of the Bayesian approach¹².

Practically speaking, a simple linear regression can be expressed in the following form:
$$y \sim \mathcal{N}(\mu, \sigma)$$

(Read as “$y$ is drawn from a normal distribution with mean $\mu$ and standard deviation $\sigma$”).

In the the above expression the model relies on a Gaussian, or normal likelihood ($\mathcal{N}$) to describe the data — making assertions regarding how we believe the underlying data was generated. The Gaussian distribution is parameterised by a location parameter ($\mu$) and a standard deviation ($\sigma$).

If we were uninterested in prediction, we could describe the shape of the distribution of counts ($y$) without a predictor variable. In this approach, we could specify our model by providing priors for $\mu$ and $\sigma$ that express a level of belief in their likely values:

$$\begin{eqnarray}
y &\sim& \mathcal{N}(\mu, \sigma) \\
\mu &\sim& \mathcal{N}(0, 1) \\
\sigma &\sim& \mathbf{HalfCauchy}(2)
\end{eqnarray}$$

This provides an initial belief as to the likely shape of the data that informs, via arcane computational procedures, the model of how the observed data approaches the underlying truth¹³.

This model is less than interesting, however. It simply defines a range of possible Gaussian distributions without unveiling the horror of the underlying relationships between unsuspecting terrestrial inhabitants and anomalous events.

To construct such a model, relating a predictor to a response, we express those relationships as follows:

$$\begin{eqnarray}
y &\sim& \mathcal{N}(\mu, \sigma) \\
\mu &=& \alpha + \beta x \\
\alpha &\sim& \mathcal{N}(0, 1) \\
\beta &\sim& \mathcal{N}(0, 1) \\
\sigma &\sim& \mathbf{HalfCauchy}(1)
\end{eqnarray}$$

In this model, the parameters of the likelihood are now probability distributions themselves. From a traditional linear model, we now have an intercept ($\alpha$), and a slope ($\beta$) that relates the change in the predictor variable ($x$) to the change in the response. Each of these hyperparameters is fitted according to the observed dataset.

A New Model

We can now break free from the bonds of pure linear regression and consider other distributions that more naturally describe data of the form that we are considering. The awful power of GLMs is that they can use an underlying linear model, such $\alpha + \beta x$, as parameters to a range of likelihoods beyond the Gaussian. This allows the natural description of a vast and esoteric menagerie of possible data.

The second key element of a generalised linear model is the link function that transforms the relationship between the parameters and the data into a form suitable for our twisted calculations. We can consider the link function as acting on the linear predictor — such as $\alpha + \beta x$ in our example model — to represent a different relationship via a range of possible functions, many of which are inextricably bound to certain likelihood functions.

For count data the most commonly-chosen likelihood is the Poisson distribution, whose sole parameter is the arrival rate ($\lambda$). While somewhat restricted, as we will see, we can begin our descent into madness by fitting a Poisson-based model to our observed data. For Poisson-based generalised linear models, the canonical link function is the log — our linear predictor, rather than directly being the parameter $\lambda$ is instead the logarithm of $\lambda$. The insidious effects of this on the output of the model will become all too obvious as we persist.

Stan

To fit a model, we will use the Stan probabilistic programming language. Stan allows us to write a program defining a stastical model which can then be fit to the data using Markov-Chain Monte Carlo (MCMC) methods. In effect, at a very abstract level, this approach uses a random sampling to discover the values of the parameters that best fit the observed data¹⁴.

Stan lets us specify models in the form given above, along with ways to pass in and define the nature and form of the data. This code can then be called from R using the rstan package.

In this, and subsequent posts, we will be using Stan code directly as both a learning and explanatory exercise. In typical usage, however it is often more convenient to use one of two excellent R packages brms or rstanarm that allow for more compact and convenient specification of models, with well-specified raw Stan code generated automatically.

De Profundis

In seeking to take our first steps beyond the placid island of ignorance of the Gaussian, the Poisson distribution is a first step for assessing count data. Adapting the Gaussian model above, we can propose a predictive model for the entire population of states as follows:

$$\begin{eqnarray}
y &\sim& \mathbf{Poisson}(\lambda) \\
\log( \lambda ) &=& \alpha + \beta x \\
\alpha &\sim& \mathcal{N}(0, 1) \\
\beta &\sim& \mathcal{N}(0, 1)
\end{eqnarray}$$

The sole parameter of the Poisson is the arrival rate ($\lambda$) that we construct here from a population-wide intercept ($\alpha$) and slope ($\beta$). Note that, in contrast to earlier models, the linear predictor is subject to the $\log$ link function.

The Stan code for the above model, and associated R code to run it, is below:

Show model specification and execution code.

population_model_poisson.stan
[code language=”c”]

data {

// Number of rows (observations)
int observations;

// Predictor (population of state)
vector[ observations ] population_raw;

// Response (counts)
int counts[observations];

}

transformed data {

// Center and scale the predictor
vector[ observations ] population;
population = ( population_raw – mean( population_raw ) ) / sd( population_raw );

}

parameters {

// Intercept
real< lower=0 > a;

// Slope
real< lower=0 > b;

}

transformed parameters {

vector[ observations ] mu;
mu = a + b * population;

}

model {

// Priors
a ~ normal( 0, 1 );
b ~ normal( 0, 1 );

// Model using the log-parameterised poisson
counts ~ poisson_log( mu );

}

generated quantities {

// Posterior predictions
vector[observations] counts_pred;

// Log likelihood (for LOO)
vector[observations] log_lik;

for (n in 1:observations) {

log_lik[n] = poisson_log_lpmf( counts[n] | mu[n] );
counts_pred[n] = poisson_log_rng( mu[n] );

}

[/code]

population_model_poisson.r
[code language=”r”] library( tidyverse )
library( magrittr )

library( ggplot2 )
library( showtext )

library( rstan )
library( tidybayes )
library( loo )

# Load UFO data
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

# Fit model of UFO sightings (Poisson)
# As this is computationally expensive, the fitted model will be
# saved to disk, and the process only run if the saved model file
# does not already exist.
if( not( file.exists( "work/fit_ufo_pop_poisson.rds" ) ) ) {

message("Fitting basic Poisson model.")
sms_notify("Fitting basic Poisson model.")

fit_ufo_pop_poisson <-
stan( file="model/population_model_poisson.stan",
data=list(
observations = nrow( ufo_population_sightings ),
population = extract2( ufo_population_sightings, "population" ),
counts = extract2( ufo_population_sightings, "count" )
)
)

saveRDS( fit_ufo_pop_poisson, "work/fit_ufo_pop_poisson.rds" )

message("Basic Poisson model fitted.")

} else {

fit_ufo_pop_poisson <- readRDS( "work/fit_ufo_pop_poisson.rds" )

}
[/code]

With this model encoded and fit, we can now peel back the layers of the procedure to see the extent to which it has endured the horror of our data.

The MCMC algorithm that underpins Stan — specifically Hamiltonian Monte Carlo (HMC) using the No U-Turn Sampler (NUTS) — attempts to find an island of stability in the space of possibilities that corresponds to the best fit to the observed data. To do so, the algorithm spawns a set of Markov chains that explore the parameter space. If the model is appropriate, and the data coherent, the set of Markov chains end up converging to exploring a similar, small set of possible states.

Validation

When modelling via this approach, a first check of the model’s chances of having fit correctly is to examine the so-called ‘traceplot’ that shows how well the separate Markov chains ‘mix’ — that is, converge to exploring the same area of the parameter space¹⁵. For the Poisson model above, the traceplot can be created using the bayesplot library:

Traceplot of Markov chains from Poisson model fitting.

Traceplot of Markov chains from Poisson model fitting. (PDF Version)

Show traceplot code.

[code language=”r”] library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )

# Load UFO data
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

# UFO reporting font
font_add( "main_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
font_add( "bold_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
showtext_auto()

# Bayesplot needs to be told which theme to use as a default.
theme_set( theme_weird() )

# Read the fitted model
fit_ufo_pop_poisson <- readRDS( "work/fit_ufo_pop_poisson.rds" )

# First, as always, a traceplot
tp <-
traceplot(
fit_ufo_pop_poisson,
pars = c("a", "b"),
ncol=1 ) +
scale_colour_viridis_d( name="Chain", direction=-1 ) +
theme_weird()

title <-
ggdraw() +
draw_label("Traceplot of Key Model Parameters", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.40)

save_plot("output/poisson_traceplot.pdf",
titled_tp,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )
[/code]

These traceplots exhibit the characteristic insane scribbling of well-mixed chains often referred to, in hushed whispers, as weirdly reminiscent of a hairy caterpillar; the separate lines representing each chain are clearly overlapping and exploring the same forbidding regions. If, by contrast, the lines were largely separated or did not show the same space, there would be reason to believe that our model had become lost and unable to find a coherent voice amongst the myriad babbling murmurs of the data.

A second check on the sanity of the modelling process is to examine the output of the model itself to show the value of the fitted parameters of interest, and some diagnostic information:

fit_ufo_pop_poisson %>%
summary(pars=c("a", "b" )) %>%
extract2( "summary" )
       mean      se_mean          sd      2.5%       25%       50%      75%     97.5%    n_eff      Rhat
a 4.0236045 1.026568e-04 0.004851688 4.0139485 4.0203329 4.0236485 4.026829 4.0330836 2233.626 0.9995597
b 0.5070227 6.206903e-05 0.002263160 0.5027733 0.5054245 0.5069979 0.508547 0.5115027 1329.477 1.0021745

For assessment of successful model fit, the Rhat ($\hat{R}$) value represents the extent to which the various Markov chains exploring the parameter space, of which there are four by default in Stan, are consistent with each other. As a rule of thumb, a value of $\hat{R} \gt 1.1$ indicates that the model has not converged appropriately and may require a longer set of random sampling iterations, or an improved model. Here, the values of $\hat{R}$ are close to the ideal value of 1.

As a final step, we should examine how well our model can reproduce the shape of the original data. Models aim to be eerily lifelike parodies of the truth; in a Bayesian framework, and in the Stan language, we can build into the model the ability to draw random samples from the posterior predictive distribution — the set of parameters that the model has learnt from the data — to create new possible values of the outcomes based on the observed inputs. This process can be repeated many times to produce a multiplicity of possible outcomes drawn from model, which we can then visualize to see graphically how well our model fits the observed data.

In the Stan code above, this is created in the generated_quantities block. When using more convenient libraries such as brms or rstanarm, draws from the posterior predictive distribution can be obtained more simply after the model has been fit through a range of helper functions. Here, we undertake the process manually.

We can see, then, how well the Poisson distribution, informed by our selection of priors, has shaped itself to the underlying data.

Posterior predictive density plot of fitted Poisson model.

Posterior predictive density plot of fitted Poisson model. (PDF Version)

Show posterior predictive plot code.

[code language=”r”] library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )

# Load UFO data
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

# UFO reporting font
font_add( "main_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
font_add( "bold_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
showtext_auto()

# Plots, posterior predictive checking, LOO.

# Bayesplot needs to be told which theme to use as a default.
theme_set( theme_weird() )

# Read the fitted model
fit_ufo_pop_poisson <- readRDS( "work/fit_ufo_pop_poisson.rds" )

## Model checking visualisations

# Extract posterior estimates from the fit (from the generated quantities of the stan model)
counts_pred_poisson <- as.matrix( fit_ufo_pop_poisson, pars = "counts_pred" )

# Posterior predictive density. (Visual representation of goodness of fit.)
# Sample 50 rows for overlay
counts_pred_sample <-
counts_pred_poisson[ sample( nrow( counts_pred_poisson ), 50 ), ] gp_ppc <-
ppc_dens_overlay(
y = extract2( ufo_population_sightings, "count" ),
yrep = counts_pred_sample,
alpha=0.4) +
theme_weird()

save_plot("output/poisson_posterior_predictive.pdf",
titled_gp_ppc,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )
[/code]

In the diagram above, the yellow line shows the densities of count values; the cyan lines show a sample of twisted mockeries spawned by our piscine approximations. The model has roughly captured the shape of the distribution of the original data, but demonstrates certain hideous dissimilarities — the peak of the posterior predictive distribution is significantly skewed away from the observed value.

To appreciate the full horror of what we have wrought we can plot the predictions of the model against the real data.

Global poisson GLM of UFO sightings against population.

Global poisson GLM of UFO sightings against population. (PDF Version)

Show posterior predictive plot code.

[code language=”r”]

library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )
library( modelr )

# Load UFO data and model
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

fit_ufo_pop_poisson <-
readRDS("work/fit_ufo_pop_poisson.rds")

# UFO reporting font
font_add( "main_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
font_add( "bold_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
showtext_auto()

# Plots, posterior predictive checking, LOO
theme_set( theme_weird() )

# Use teal colour scheme
color_scheme_set( "teal")

## Model checking visualisations

# Extract posterior estimates from the fit (from the generated quantities of the stan model)
counts_pred_poisson <- as.matrix( fit_ufo_pop_poisson, pars = "counts_pred" )

# US state data
us_state_factors <-
levels( factor( ufo_population_sightings$state ) )

## Create per-state predictive fit plots

# Convert fitted model (stanfit) object to a tibble
fit_tbl <-
summary(fit_ufo_pop_poisson)$summary %>%
as.data.frame() %>%
mutate(variable = rownames(.)) %>%
select(variable, everything()) %>%
as_tibble()

counts_predicted <-
fit_tbl %>%
filter( str_detect(variable,’counts_pred’) )

ufo_population_sightings_pred <-
ufo_population_sightings %>%
ungroup() %>%
mutate( count_mean = counts_predicted$mean,
lower = counts_predicted$`2.5%`,
upper = counts_predicted$`97.5%`)

# (Using mean and SD of fit summary)
predictive_plot <-
ggplot( ufo_population_sightings_pred ) +
geom_point( aes( x=population, y=count ), colour="#0b6788", size=0.6, alpha=0.8 ) +
geom_line(aes( x=population, y=count_mean ), colour="#3cd070" ) +
geom_ribbon(aes(x=population, ymin = lower, ymax = upper ), alpha = 0.2, fill="#3cd070") +
labs( x="Population (Thousands)", y="Annual Sightings" ) +
scale_fill_viridis_d( name="State" ) +
scale_colour_viridis_d( name="State" ) +
theme(
axis.title.y = element_text( angle=90 ),
legend.position = "none" )

# Construct full plot, with title and backdrop.
title <-
ggdraw() +
draw_label("UFO Sightings against State Population (1990-2014)", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("Poisson GLM. 50% credible intervals.", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.48) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.16)

data_label <- ggdraw() +
draw_label("Data: http://www.nuforc.org | Tool: http://www.mc-stan.org", fontfamily="main_font", colour = "#cccccc", size=12, hjust=1, x=0.98 )

save_plot("output/poisson_predictive_plot.pdf",
predictive_plot_titled,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

[/code]

This shows a notably different line of best fit to that produced from the basic Gaussian model in the previous post. The most visible difference is the curved predictor resulting from the $\log$ link function, which appears to account for the changes in the data very differently to the constrained absolute linearity of the previous Gaussian model¹⁶. Whether this is more or less effective remains to be seen.

Unsettling Distributions

In this post we have opened our eyes to the weirdly non-linear possibilities of generalised linear models; sealed and bound this concept within the wild philosophy of Bayesian inference; and unleashed the horrifying capacities of Markov Chain Monte Carlo methods and their manifestation in the Stan language.

Applying the Poisson distribution to our records of extraterrestrial sightings, we have seen that we can, to some extent, create a mindless Golem that imperfectly mimics the original data. In the next post, we will delve more deeply into the esoteric possibilities of other distributions for count data, explore ways in which to account for arcane relationships across and between per-state observations, and show how we can compare the effectiveness of different models to select the final glimpse of dread truth that we inadvisably seek.

Footnotes

Bayes vs. the Invaders! Part One: The 37th Parallel

moth — Wed, 03 Apr 2019 13:03:10 +0000

Introduction

From our earlier studies of UFO sightings, a recurring question has been the extent to which the frequency of sightings of inexplicable otherworldly phenomena depends on the population of an area. Intuitively: where there are more people to catch a glimpse of the unknown, there will be more reports of alien visitors.

Is this hypothesis, however, true? Do UFO sightings closely follow population or are there other, less comforting, factors at work?

In this short series of posts, we will build a statistical model of UFO sightings in the United States, based on data previously scraped from the National UFO Reporting Centre and see how well we can predict the rate of UFO sightings based on state population.

This series of posts is part tutorial and part exploration of a set of modelling tools and techniques. Specifically, we will use Generalized Linear Models (GLMs), Bayesian inference, and the Stan probabilistic programming language to unveil the relationship between unsuspecting populations of US states and the dread sightings of extraterrestrial truth that they experience.

Data

As mentioned, we will rely on data from NUFORC for extraterrestrial sightings.

For population data, we can rely on the the FRED database for historical US state-level census data. The combination of these datasets provides us with a count of UFO sightings per year for each state, and the population of that state in that year.

The downloading and scraping code is included here:

Show scraping code.

ZSH script to download via `curl`
[code language=”bash”] #!/bin/zsh
# Download US state-level population datasets from FRED
# State series names are stored in the file ‘series_names’ (downloaded from fred.stlouisfed.org)
#
#
# The per-series requests is included below.

export IFS=$’\n’

# Download
for state_series in $(cat series_names); do

curl -o "output/$state_series.csv" "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=$state_series&scale=left&cosd=1900-01-01&coed=2018-01-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Annual&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2019-03-04&revision_date=2019-03-04&nd=1900-01-01"

done
[/code]

Necessary ‘series_names’ file:
[code language=”text”] WAPOP
GAPOP
CAPOP
MOPOP
DSPOP
ILPOP
TXPOP
NYPOP
FLPOP
ALPOP
COPOP
WIPOP
AZPOP
MIPOP
NCPOP
MAPOP
CTPOP
LAPOP
OHPOP
AKPOP
TNPOP
MNPOP
NJPOP
NMPOP
ARPOP
MDPOP
PAPOP
NVPOP
IAPOP
ORPOP
T5POP
DCPOP
HIPOP
NDPOP
KYPOP
VAPOP
IDPOP
KSPOP
INPOP
WVPOP
RIPOP
SCPOP
MSPOP
DEPOP
MTPOP
MEPOP
NEPOP
OKPOP
WYPOP
UTPOP
NHPOP
VTPOP
SDPOP
[/code]

R code to combine data into tidy format
[code language=”r”] library( tidyverse )

# Read all CSV files
census_files <- list.files( "output", full.names=TRUE )

# Join all data into a single table
census_data <-
census_files %>%
map( read_csv ) %>% # Read each file, forming a list with an element for each
reduce( full_join, by="DATE" ) %>% # Reduce (left to right) running a full join
dplyr::arrange( DATE ) %>% # Sort by date
gather( key="state", value="population", -DATE ) %>% # Gather to long format
transmute( date=DATE, state=str_replace( state, "POP", "" ), population ) # Rename and tidy variables and names

# Output to an .rds
saveRDS( census_data, "data/annual_population.rds" )

[/code]

For ease, we will treat each year’s count of sightings as independent from the previous year’s — we do not make an assumption that the number of sightings in each year is based on the number of sightings in the previous year, but is rather due to the unknowable schemes of alien minds. (If extraterrestrials visitors were colonising areas in secrecy rather than making sporadic visits, and thus being seen repeatedly, we might not want to make such a bold assumption.) Each annual count will be treated as an individual, independent data point relating population to count, with each observation tagged by state.

For simplicity, particularly in building later models, we will restrict ourselves to sightings post 1990, roughly reflecting a period in which the NUFORC data sees a significant increase in reporting and thus relies less on historical reports. (NUFORC’s phone hotline has existed since 1974, and its web form since 1998.)

An Awful Simplicity

To begin, we start with the most basic form of model: a simple linear relationship between the count of sightings and the population of the state at that time. If sightings were purely dependent on population, it might be reasonable to assume that such a model would fit the data fairly well.

This relationship can be plotted with relative ease using the geom_smooth() function of ggplot2 in R. For opening our eyes to the awful truth contained in the data, this is a useful first step.

Global linear regression of UFO sightings against population.

Global linear regression of UFO sightings against population. (PDF Version)

While this graph does seem to support the argument that sightings increase with population in general, a closer inspection shows that the individual data points are clearly clustered. If we highlight the location of each data point, colouring points by US state, this becomes clearer:

Global linear regression of UFO sightings against population with per-state colours.

Global linear regression of UFO sightings against population with per-state colours. (PDF Version)

This strongly suggests that, in preference to the simple linear model across all sightings, we might instead fit a linear model individually to each state:

Per-state linear regression of UFO sightings against population,

Per-state linear regression of UFO sightings against population. (PDF Version)

The code to produce the above graphs from the NUFORC and FRED data is given below:

Show data preparation and visualization code.

Prepare and combine datasets:
[code language=”r”] library( tidyverse )
library( magrittr )
library( lubridate )

# Prepare data for model fitting (and plotting)

# Load US population and UFO datasets
ufo <- read_csv( "data/ufo_spatial.csv" )
census <- readRDS( "data/annual_population.rds" )

# Process UFO data to per-state counts per year.
# Drop Puerto Rico as we don’t have census data. (Also, very few sightings — 33 in dataset.)
ufo_state_annual <-
ufo %>%
# US only
filter( country == "us" ) %>%
# Apologies to Puerto Rico.
filter( state != "pr" ) %>%
# Convert date to year, drop all other variables except state.
transmute( date = year( as.POSIXct( datetime, format="%m/%d/%Y %H:%M" ) ), state=str_to_upper( state ) ) %>%
# Group by year
group_by( date, state ) %>%
# Sum sightings
summarize( count = n() )

# Process census suitable for joining with UFO sightings.
# Drop "DS" state entry — ("Department of State"?)
census <-
census %>%
filter( state != "DS" ) %>%
mutate( date=year( date ) )

# Join datasets
ufo_population_sightings <-
full_join( ufo_state_annual, census )

# Missing data implies zero sightings.
# Restrict to post-1990 to avoid a high proportion of very small numbers of
# sightings.
ufo_population_sightings <-
ufo_population_sightings %>%
mutate( count = replace_na( count, 0 ) ) %>%
filter( !is.na( population ) ) %>%
filter( date >= 1990 ) %>%
filter( date <= 2014 )

saveRDS( ufo_population_sightings, "work/ufo_population_sightings.rds" )
[/code]

Fit linear trend in data via geom_smooth() using a linear model.
[code language=”r”] library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( RColorBrewer )

library( cowplot )

# Load UFO data
ufo_population_sightings <-
readRDS("work/ufo_population_sightings.rds")

# UFO reporting font
font_add( "main_font", "/usr/share/fonts/TTF/weird/Tox Typewriter.ttf")
showtext_auto()

# Combined plot ignoring states.
ufo_pop_plot <-
ggplot( ufo_population_sightings, aes( x=population, y=count ) ) +
geom_point( colour="#0b6788", size=0.6, alpha=0.8 ) +
geom_smooth( method="lm", colour="#3cd070" ) + # UFO green
xlab( "Population" ) +
ylab( "Sightings per annum" ) +
theme_weird() +
theme(
axis.title.y = element_text( angle=90 )
)

# Construct full plot, with title and backdrop.
title <-
ggdraw() +
draw_label("UFO Sightings against State Population (1990-2014)", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.40)

data_label <- ggdraw() +
draw_label("Data: http://www.nuforc.org", fontfamily="main_font", colour = "#cccccc", size=12, hjust=1, x=0.98 )

ufo_pop_titled <-
plot_grid(title, ufo_pop_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/lm_ufo_population_sightings-combined.pdf",
ufo_pop_titled,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

# Combined plot colouring states.
ufo_pop_plot_states <-
ggplot( ufo_population_sightings, aes( x=population, y=count ) ) +
geom_point( aes( colour=state ), size=0.6, alpha=0.8 ) +
geom_smooth( method="lm", colour="#3cd070" ) + # UFO green
xlab( "Population" ) +
ylab( "Sightings per annum" ) +
scale_colour_manual( values=rep( brewer.pal( name="Set3", n=12 ), times=5 ) ) +
theme_weird() +
theme(
axis.title.y = element_text( angle=90 ),
legend.position="none"
)

# Construct full plot, with title and backdrop.
title <-
ggdraw() +
draw_label("UFO Sightings against State Population (1990-2014)", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("(Per-state sightings)", fontfamily="main_font", colour = "#cccccc", size=16, hjust=0, vjust=1, x=0.02, y=0.48) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.16)

data_label <- ggdraw() +
draw_label("Data: http://www.nuforc.org", fontfamily="main_font", colour = "#cccccc", size=12, hjust=1, x=0.98 )

ufo_pop_states_titled <-
plot_grid(title, ufo_pop_plot_states, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/lm_ufo_population_sightings-state.pdf",
ufo_pop_states_titled,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

# Combined plot colouring states with per-state trend lines.
ufo_pop_plot_states_trends <-
ggplot( ufo_population_sightings, aes( x=population, y=count ) ) +
geom_point( aes( colour=state ), size=0.6, alpha=0.8 ) +
geom_smooth( method="lm", aes( colour=state ) ) +
xlab( "Population" ) +
ylab( "Sightings Per Annum" ) +
scale_colour_manual( values=rep( brewer.pal( name="Set3", n=12 ), times=5 ) ) +
theme_weird() +
theme(
axis.title.y = element_text( angle=90 ),
legend.position="none"
)

# Construct full plot, with title and backdrop.
title <-
ggdraw() +
draw_label("UFO Sightings against State Population (1990-2014)", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("(Per-state trends)", fontfamily="main_font", colour = "#cccccc", size=16, hjust=0, vjust=1, x=0.02, y=0.48) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.16)

data_label <- ggdraw() +
draw_label("Data: http://www.nuforc.org", fontfamily="main_font", colour = "#cccccc", size=12, hjust=1, x=0.98 )

ufo_pop_states_trends_titled <-
plot_grid(title, ufo_pop_plot_states_trends, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/lm_ufo_population_sightings-trends.pdf",
ufo_pop_states_trends_titled,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )

[/code]

Result

The plots shown here strongly indicate that the rate of dread interplanetary visitations per capita varies differently per state. It seems, therefore, that while the number of sightings is generally proportional to population, the specific relationship is state-dependent.

This simple linear model is, however, entirely unsatisfactory in describing the data, despite its support for the argument that different states have different underlying rates of sightings.

In the next post, therefore, we will delve deeper into the unsettling relationships between UFO sightings and the innocent humans to which they are drawn. To do so, we will have to consider a class of techniques that go beyond the normal distribution that underpins key assumptions of the simple linear models used here, and so move into the eldritch world of generalized linear models.