Never tell me the odds!
eoc/jw
February 13, 2018
Han Solo, the Bayesian Archaeologist?
About a week ago on the QUANTARCH FB page Jesse Wolfhagen and I posted R-code to accompany “Count Bayesie’s” excellent introductory example of Bayesian inference using Han Solo. This is a wonderful tool to introduce Bayesian thinking, in everyday language, to everyday people, and to your everyday Archaeologist and Biological Anthropologist.
Having the code and Count Bayesie’s awesome narrative apart, as we set them, seemed unfortunate. Reading separately, the concepts are not as easy to follow or replicate on one’s own. In an effort to allow interested individuals to follow along for better understanding and even replicate Bayesie’s analysis, I am retelling Count Bayesie’s example (below in gray block quotes) and interlacing step-by-step R-code to replicate it (eoc/jw code). But let’s be clear, credit for this example goes 100% to Will Kurt, a.k.a. Count Bayesie.
To help users follow this example, we also provide flavor of our own. Below the example, we use rjags and rstan to conduct a ‘proper’ Bayesian analyses, via Markov Chain Montecarlo (MCMC) simulations. We hope readers find it helpful. If there are questions, please feel free to contact Jesse or me.
In summary Count Bayesie uses a scene from Star Wars to help introduce Bayesian thinking.
“One of the most memorable errors in statistical analysis is a scene from The Empire Strikes Back! Han Solo, attempting to evade enemy fighters, flies the Millennium Falcon into an asteroid field. The ever knowledgeable C3PO informs Solo that probability is not on his side:
C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1!
Han: Never tell me the odds!
Here’s the scene for those who haven’t seen it or may have forgotten.
Superficially this is just a fun movie dismissing “boring” data analysis, but there’s actually an interesting dilemma here. Even the first time you watch Empire you know that Han can pull it off. But, despite deeply believing that Han will make it through, is C3PO’s analysis wrong? Clearly Han believes it’s dangerous, “They’d have to be crazy to follow us.” None of the pursuing tie fighters make it through, which provides pretty strong evidence that C3PO’s numbers are not off. So what are we missing?
What’s missing is that we know Han is a badass! C3PO isn’t wrong, he’s just forgetting to add essential information to his calculation. The question now is: can we find a way to avoid C3PO’s error without dismissing probability entirely as Han proposes? To answer this we’ll have to model how both C3PO thinks and what we believe about Han, then find a method to blend those models.
C3PO’s mind
We’ll start by taking apart C3PO’s reasoning.
We know C3PO well enough by this point in the film to realize that he’s not just making numbers up. C3PO is fluent in over 6 million forms of communication, and that takes a lot of data to support. We can assume then that he has actual data to back up his claim of “approximately 3,720 to 1”. Because C3PO mentions 1:3720 is the approximate odds of successfully navigating an asteroid field we know that the data he has only gives him enough information to suggest a range of possible rates of success.
The only outcomes that C3PO is considering are successfully navigating the asteroid field or not. If we want to look at the various possible probabilities of success given the data C3PO has, the distribution we’re going to use is the Beta distribution. We can define C3PO’s reasoning with the following equation \[P(RateOfSuccess|Successes)=Beta(\alpha,\beta)\]
The Beta distribution is parameterized with an \(\alpha\) (number of times success observed, alpha) and a \(\beta\) (the number of times failure is observed, beta). This distribution tells us which rates of success are most likely given the data we have. We can’t really know what’s in C3PO’s head, but let’s assume that not too many people have actually made it through an asteroid field and in general not that many people try (because it’s crazy!). We’re going to say that C3PO has records of 2 people surviving and 7,440 people ending their trip through the asteroid field in a glorious explosion! Below is a plot of the Probability Density Function that represents C3PO’s belief in the true rate of success when entering an asteroid field.
# Original R code to replicate the analysis about Han
# C3PO's estimate (the likelihood)
not_han<-seq(0.0000,1, length.out=100000)
a_lik<-2
b_lik<-7440
lik_not_han<-dbeta(not_han,a_lik,b_lik)
plot(not_han,lik_not_han, type="l", xlim=c(0,0.005), xlab="Probability of Success", ylab="Density/Likelihood", main="C3PO's data backed beliefs, Beta(2,7440)")
For any ordinary pilot entering an asteroid field this looks bad. In Bayesian terms, C3PO’s estimate of the true rate of success given observed data is referred to as the Likelihood.
But Han is a badass
The problem with C3PO’s analysis is that his data is on all pilots, and Han is far from your average pilot. If we can’t put a number to Han’s “badass” then our analysis is broken, not just because Han makes it (we have p-values to blame for that), but because we believe he’s going to. Statistics is a tool to aid and organize our reasoning and beliefs about the world. If our statistical analysis not only contradicts our reasoning and beliefs but also fails to change them, then something is wrong with our analysis.
Why do we believe Han will make it? Because Han makes it through everything that’s happened so far! What makes Han Solo, Han Solo is that no matter how unlikely he is to make it through something he still succeeds! We have a prior belief that Han will survive the asteroid field. The Prior Probability is something that is very controversial for people outside of Bayesian analysis. Many people feel that just “making up” a prior is not objective. This scene from Empire is an object lesson in why it is even more absurd to throw out our prior beliefs. Imagine watching Empire the first time, getting to this scene and having a friend sincerely tell you that “whelp, Han is dead now”. It is worth pointing out again, C3PO is not entirely wrong. If your friend said “whelp, those Tie fighters are dead now”, you would likely chuckle in agreement.
Now we have to come up with an estimate for our Prior Probability that Han will successfully navigate the asteroid field. We do have a real problem though, we have a lot of reasons for believing Han will survive but no numbers to back that up. We have to make a guess. Let’s start with some sort of upper bound on his badassness. If we believe it was impossible for Han to die then the movie becomes boring. At the other end, I personally feel much more strongly about Han being able to make it than C3PO does about him failing. I’m going to say I roughly feel that Han has a 20,000:1 chance of making it through a situation like this. We’ll use another Beta distribution to express this for two reasons. First my beliefs are very approximate, so I’m okay with the true rate of survival being variable. Second, it makes calculations we need to do later much easier. Here is our distribution for our Prior Probability that Han will make it:
# Our prior knowledge about Han, because he's Han
han<-not_han # because the values can be the same, it is the PDF parameters that matter, right?
a_prior<-20000
b_prior<-1
prior_han<-dbeta(han,a_prior,b_prior)
han<-seq(0,1, length.out=100000)
plot(han,prior_han, col="red", type="l", xlim=c(.9990,1), xlab="Probability of Success", ylab="Density/Likelihood", main="Belief that Han will Succeed, Beta(20000,1)")
Creating Suspense with a Posterior
We have now established what C3PO believes (Likelihood) and modeled our own beliefs in Han (Prior Probability), but we need a way to combine these. By combining beliefs we create what is called our Posterior Distribution. In this case the Posterior models our sense of suspense upon learning the Likelihood from C3PO. The purpose of C3PO’s analysis is in part to poke fun at his analytical thinking, but also to create a sense of real danger. Our Prior alone would leave us completely unconcerned for Han, but when we adjust it based on C3PO’s data we get a new belief in the real danger. The formula for the Posterior is actually very simple and intuitive: \[Posterior = Likelihood \cdot Prior\]
The only thing not explicitly stated in this formula is that we usually want to normalize everything so it sums up to 1. It also turns out that combining our two Beta distributions in this way, including the normalization, is remarkably easy.
\[Beta(\alpha_{posterior}, \beta_{posterior}) = Beta(\alpha_{likelihood} + \alpha_{prior}, \beta_{likelihood} + \beta_{prior})\]
And here is what our final, Posterior, belief looks like:
# The posterior probability that Han makes it out alive
a_post<-a_lik + a_prior
b_post<-b_lik + b_prior
post_han<-dbeta(han,a_post,b_post)
plot(han,post_han, col="red", type="l", xlab="Probability of Success", ylab="Posterior Density")
Let’s zoom in to between p 0.7 and 0.76
plot(han,post_han, col="red", type="l", xlim=c(.7,.76), xlab="Probability of Success", ylab="Posterior Density")
abline(v=c(0.72,0.74), col="blue", lty=2)
Looks like most of the probability is centered slightly below 0.73, somewhere above 0.72 and below 0.74
Combining our C3PO belief with our Han-is-badass belief we find that we have a much less extreme position than either of these. Our Posterior belief is a roughly 75% chance of survival, which means we still think Han has a good shot of making it, but we’re much more nervous.
Conclusion
This post has been pretty light on math, but the real aim was to introduce the idea of Bayesian Priors and show that they are as rational as believing that Han Solo isn’t facing certain doom by entering the asteroid field. At the same time we can’t just throw away the information that C3PO has to share with us. The only sane way to understand the situation is to combine our belief in Han with the information C3PO has provided. This concept is a fundamental principle of Bayesian analysis.”
‘Proper’ Bayesians
We cannot beat Count Bayesie’s example. However, we can add to it, by showing how to conduct a more ‘proper’ Bayesian analysis, normalizing the posterior probabilities to sum to 1. Count Bayesie’s example computed the posterior probability by \[Posterior = Likelihood \cdot Prior\], which is proportional to Bayes Theorem: \[Posterior = \frac{Likelihood \cdot Prior}{P(Data)}\]
We provide code here to used to compute Bayes Theorem including standardizing the numerator (i.e., the product of the likelihood and the prior) by \(P(Data)\). I am going to use Rjags, because it’s easy. Jesse Wolfhagen, of course interjected and provided Rstan code to conduct this much more efficiently. We have both tried to comment the code as best as possible for user replication. Please feel free to contact either one of us if there are questions.
To install Rjags, you will need JAGS installed. Please go here to install the appropriate version of JAGS, given your operating system (Windows/Mac/Linux).
library(rjags) # need rjags installed, so install.packages(rjags, dependencies=TRUE)
## Loading required package: coda
## Linked to JAGS 4.3.0
## Loaded modules: basemod,bugs
n <- 7440 # C3po knows that 7,440 asteroid field attempts have been made
obs <- 2 # 3po also knows that only two made it
a <- 20000 # we know that Han has a 20k:1 chance of making it (pretty high!)
b <- 1 # we express this belief as a beta distribution
# first we write a text model for rjags
model_text <- "model{
# 3PO's likelihood, the probability that a random person (or storm trooper)
# will survive the asteroid field expressed as a binomial probability
Y ~ dbinom(theta,n)
# Our prior knowledge that Han has a high chance of surviving expressed as a beta prior probability
theta ~ dbeta(a, b)
}"
model <- jags.model(textConnection(model_text),
data = list(Y=obs,n=n,a=a,b=b),
n.chains = 5,
n.adapt = 100)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 5
##
## Initializing model
update(model, 10000, progress.bar="none"); # Burnin to rid of any correlated unpleasantries
# sample from the model
sample <- coda.samples(model,
variable.names=c("theta"),
n.iter=50000, progress.bar="none")
# Let's summarize the results, point estimate and quantiles
# Remember, we are interested in the RANGE estimate of probability,
# more than we are interested in the point estimate
summary(sample)
##
## Iterations = 10101:60100
## Thinning interval = 1
## Number of chains = 5
## Sample size per chain = 50000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 7.289e-01 2.686e-03 5.372e-06 6.643e-06
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.7236 0.7271 0.7289 0.7307 0.7342
# more specifically, where does 99% of the posterior probability range?
quantile(unlist(sample),probs=c(0.01,0.99))
## 1% 99%
## 0.7226139 0.7351555
#plot
# Here we see the randomly sampled chains (left) and the posterior probability distribution (right)
plot(sample)
‘Proper’ and ‘modern’ Bayesians
STAN is state-of-the-art Bayesian software (hence the ‘modern’). This is Jesse Wolfhagen’s R and Stan code from QUANTARCH. I EOC made trivial adjustments like adding the Stan model as a text object and pruning here and there, to make the code work in this medium.
library("ggplot2")
library("rstan")
## Loading required package: StanHeaders
## rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
library("StanHeaders")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
# if you do not have the following packages, please install them, now.
#install.packages(c("devtools","mvtnorm","loo","coda"), repos="https://cloud.r-project.org/",dependencies=TRUE)
library(devtools)
#install_github("rmcelreath/rethinking")
library("rethinking")
## Loading required package: parallel
## rethinking (Version 1.59)
#Han Solo success
Han_data <- list(
rec_attempts = 7440, #C-3PO's recorded attempts
rec_success = 2, #C-3PO's recorded successes
Han_prior = c(20000, 1) #our prior belief on Han's skill
)
# Build the model for rstan to read in
hans_model<-"data {
int rec_attempts; //number of successes we say C-3PO has recorded
int rec_success; //number of attempts C-3PO has recorded
real Han_prior[2]; //confidence in Han's flying ability (X:Y odds - these are estimates of a beta distribution)
}
parameters {
real<lower = 0, upper = 1> theta; //probability of Han surviving an attempt
}
model {
theta ~ beta(Han_prior[1], Han_prior[2]); //based on our estimate of Han's skill
rec_success ~ binomial(rec_attempts, theta); //evidence of other attempts
}"
#run the Stan model
Hans_run <- stan(model_code = hans_model,
data = Han_data,
iter = 2000, chains = 4
, cores = 3, verbose = FALSE)
print(Hans_run) #print out the estimates
## Inference for Stan model: 8fb588b48b972555174cee00c8c37ce6.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75%
## theta 0.73 0.00 0.0 0.72 0.73 0.73 0.73
## lp__ -16035.39 0.02 0.7 -16037.41 -16035.53 -16035.13 -16034.95
## 97.5% n_eff Rhat
## theta 0.73 1697 1
## lp__ -16034.89 1859 1
##
## Samples were drawn using NUTS(diag_e) at Mon Feb 12 22:54:52 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
# Let's see what the randomly sample chains look like using traceplot()
traceplot(Hans_run)
# this plot will show the point estimate and the range estimate (credible interval)
plot(Hans_run)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
Hans_run_posterior <- extract(Hans_run)
HPDI(c(Hans_run_posterior$theta), prob = 0.99)
## |0.99 0.99|
## 0.7219106 0.7355001
// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});