A couple of months back, I found this fantastic Medium article by Susan Li, which implemented a Bayesian model for estimating the effects of COVID-19 vaccines. As I had studied these topics as part of my course on Graphical models, I thought it would be interesting to re-implement this myself from scratch and break down the concepts. The article uses Bayesian inference and Markov-chain Monte-Carlo, which I’ll discuss in detail. For an introduction to Bayesian thinking and Bayes theorem, check out my post-Bayes Theorem – A primer.
This article is just a learning exercise and is not meant to make medical conclusions about the vaccines!!
Creating a Bayesian model for Vaccine efficacy
The premise of using a Bayesian model to study vaccine efficacy is as follows: let represent the probability that a vaccine X will prevent future infection. I have no clue what this value is but let’s not dwell on that for now; Just know that it has some unknown value in [0, 1], and that value has a probability of occurring. Now the idea is that if I have a prior belief , then given the results of the recent clinical trials for that vaccine, which I call , can I update this prior to get a posterior belief ?
From this idea, let’s define the Bayesian model skeleton as follows:
Notice that the clinical trials data is always fixed so the denominator is independent and we can get rid of it. This gives the refined new model:
All I have to worry about now is defining the prior and the likelihood. First, the prior. Because I have no clue what ’s value is, I can assume all values are equally probable, noting these values have to be constrained between 0 and 1. Hence, I assume that is beta distributed. This is essentially a uniform prior with support [0, 1], as shown below.
Now for the likelihood . To understand this, first, we need to understand the clinical trials data. The data describes out of the confirmed COVID-19 cases, how many were in the control group and how many were vaccinated. For example, the Pfizer trial found that 170 participants were COVID-19 positive, of which eight were vaccinated, and 162 were in the control group. The data for other vaccines studied in the paper were similarly studied. Below I present the code to load the clinical trials data.
import numpy as np from matplotlib import pyplot as plt from scipy.stats import beta import seaborn as sns # ================================================= # compile data from multiple clinical trial results # ================================================= n_Pfizer_total = 170 n_Pfizer_vaccine = 8 n_Pfizer_placebo = n_Pfizer_total - n_Pfizer_vaccine Pfizer_outcomes = np.concatenate([np.zeros(n_Pfizer_vaccine), np.ones(n_Pfizer_placebo)]) n_Moderna_total = 95 n_Moderna_vaccine = 5 n_Moderna_placebo = n_Moderna_total - n_Moderna_vaccine Moderna_outcomes = np.concatenate([np.zeros(n_Moderna_vaccine), np.ones(n_Moderna_placebo)]) n_AstraZeneca_1_total = 37 n_AstraZeneca_1_vaccine = 3 n_AstraZeneca_1_placebo = n_AstraZeneca_1_total - n_AstraZeneca_1_vaccine AstraZeneca_1_outcomes = np.concatenate([np.zeros(n_AstraZeneca_1_vaccine), np.ones(n_AstraZeneca_1_placebo)]) n_AstraZeneca_2_total = 94 n_AstraZeneca_2_vaccine = 26 n_AstraZeneca_2_placebo = n_AstraZeneca_2_total - n_AstraZeneca_2_vaccine AstraZeneca_2_outcomes = np.concatenate([np.zeros(n_AstraZeneca_2_vaccine), np.ones(n_AstraZeneca_2_placebo)])
The likelihood is Bernoulli distributed, where COVID-19 presenting patients in the vaccinated group are represented 0 and control group as 1. The probability of the patient being in the vaccinated group and presenting with COVID-19 later is and the probability of being in the control group and presenting with COVID-19 later is .
Below is the likelihood calculation:
Assuming the data points are independent and identically distributed (I.I.D), we then say
To avoid numerical underflow, we can instead work with the log likelihood.
This is executed in the below log-likelihood function.
def getlogLikelihood(data, p): """ Compute Bernoulli Likelihood """ loglik = np.sum(data*np.log(p) + (1 - data)*np.log(1 - p)) return loglik
The final Bayesian model is as follows:
Finding distribution of using Markov-Chain Monte-Carlo
Now that we have a Bayesian model for studying vaccine efficacy, the next step is creating a posterior distribution. Approximating a distribution can be very helpful for estimating statistical properties like the mean, the variance, and establishing confidence intervals. For creating the distribution, a straightforward and beautiful technique called Markov-chain Monte-Carlo (MCMC) was used; specifically, a type of MCMC called the Metropolis-Hastings algorithm.
MCMC is based on a broader idea called Monte-Carlo sampling, which estimates the value of some numerical quantity by randomly sampling data points rather than using a complex deterministic formula. Do check out the below video to understanding Monte-Carlo sampling – it shows how one can find the value of pi through a Monte-Carlo simulation.
I’ll now explain how MCMC works. First, let’s assume that the vaccine efficacy has a true distribution that looks like this:
Suppose we have some initial estimate of the vaccine efficacy; let’s call this estimate , which can be any random number we pick between 0 and 1. We want to generate some more sample data from this start point and to do that, we can perturb it to create a new candidate point . We add this perturbation by sampling a value from another probability distribution called the “proposal” distribution. A common choice for the proposal distribution is a Gaussian distribution with mean and standard deviation . This is shown in the picture below.
Next, we need to decide whether this candidate point should be used as a new point in the sample or whether it should be rejected. The way we do this is by comparing the ratio of the posteriors of and under the Bayesian model. This is represented as follows
If the ratio of posteriors is greater than 1, then we directly accept this sample. If lesser than 1, then we generate some uniform random number, and if the ratio is greater than this number, we accept, else reject. Intuitively, a ratio larger than one means that the data is better described by the new parameter $\theta_c &s=1$ than the parameter $\theta_0 &s=1$. This entire algorithm is run for multiple iterations to collect enough samples to give a smooth-looking distribution for the posterior.
Below is the code which implements the MCMC algorithm from scratch. There are other parameters for the algorithm, like the burn-in and sampling interval. I’ve not used these, just to keep it simple.
# ===================================================== # Implement functions for performing Bayesian inference # ===================================================== def getlogLikelihood(data, p): """ Compute Bernoulli Likelihood """ loglik = np.sum(data*np.log(p) + (1 - data)*np.log(1 - p)) return loglik def acceptance_rule(x, x_new): """ Defines acceptance rule for selecting MCMC samples """ if x_new > x: return True else: accept = np.random.uniform(0, 1) return (accept < np.exp(x_new - x)) def MCMC(iterations, param_init, data): """ Compute posterior distribution of vaccine efficacy using Markov-chain Monte-Carlo """ x = param_init accepted = [] # Enforce uniform beta prior prior = beta(a=1, b=1) for i in range(iterations): # Find new point under proposal distribution x_new = np.random.normal(x, 1) # Enforce that x_new must be a positive number while x_new > 1 or x_new < 0: x_new = np.random.normal(x, 1) # Compute likelihoods of these old and new parameter estimates x_lik = getlogLikelihood(data, x) x_new_lik = getlogLikelihood(data, x_new) # Accept or reject sample based on likelihood ratio if acceptance_rule(x_lik + np.log(prior.pdf(x)), x_new_lik + np.log(prior.pdf(x_new))): x = x_new accepted.append(x_new) return np.array(accepted)
I now implement the above code for all vaccines below:
# ==================================================== # Estimate posterior distribution for vaccine efficacy # ==================================================== nIter = 50000 param_init = np.random.uniform() # posterior_pfizer = MCMC(nIter, param_init, pfizer_data) # posterior_astrazeneca = MCMC(nIter, param_init, astrazeneca_data) posterior_pfizer = MCMC(nIter, param_init, Pfizer_outcomes) posterior_moderna = MCMC(nIter, param_init, Moderna_outcomes) posterior_astrazeneca_1 = MCMC(nIter, param_init, AstraZeneca_1_outcomes) posterior_astrazeneca_2 = MCMC(nIter, param_init, AstraZeneca_2_outcomes) # =========================== # Plot posterior distribution # =========================== sns.distplot(posterior_pfizer, hist=False, rug=True, label='Pfizer') sns.distplot(posterior_moderna, hist=False, rug=True, label='Moderna') sns.distplot(posterior_astrazeneca_1, hist=False, rug=True, label='AstraZeneca regimen 1') sns.distplot(posterior_astrazeneca_2, hist=False, rug=True, label='AstraZeneca regimen 2') plt.title("Posterior distributions of efficacy of different vaccines") plt.legend() plt.show()
The following posterior distributions are obtained, which is similar to that in the PyMC3 implementation.
With these posterior distributions, the article then performed interesting comparisons between vaccines. By computing the difference between posterior distributions, we can analyze the resulting distribution to determine how significant the difference is. For example, the article found that “we have 99% confidence that (AstraZeneca) regimen 1 is more effective than (AstraZeneca) regimen 2.”. But comparing Moderna and Pfizer, it was found that the difference was not significantly different.
Overall, this small exercise gave me a much more solidified understanding of bayesian inference, MCMC sampling, and its importance as a tool in data analysis. An extension to this implementation would be to compare other available vaccines and study their differences.