August 2011


imageimage

Firstly, something about the puppies on the cover pagesSmile. The happy puppies are named Prior, Likelihood, and Posterior. Notice that the Posterior puppy has half-up ears, a compromise between the perky ears of the Prior puppy and the floppy ears of the Likelihood puppy. The puppy on the back cover is named Evidence. MCMC methods make it unnecessary to explicitly compute the evidence, so that puppy gets sleepy with nothing much to do. These puppy images basically summarize the essence of Bayes’ and MCMC methods.

I was tempted to go over the book after reading the preface of the book where the author writes candidly about, “What to expect from the book?”.He uses poetic verses, humor to introduce the topic. He promises the usage of simple examples to explain the techniques of Bayesian analysis. The book is organized in to three parts. The first part contains basic ideas on probability, models, Bayesian reasoning and R programming. So, I guess this part could be speed read. The second part is the juicy part of the book where simple dichotomous data is used to explain hierarchical models, MCMC, sampling methods, Gibbs sampler and BUGS. The third part of the book covers various applications of Bayesian methods to real life data. So, this is more like a potpourri of applications and a reader can pick and choose whichever application appears exciting / relevant to one’s work. The highlight of this book is it enables you to DO Bayesian analysis, and not just KNOW about Bayes’. The book also gives an analogue for each of the tests that one learns in NHST( Null Hypothesis Significance testing). Historically Bayesians have always scorned at tests based on p-values. The inclusion of the tests in the book shows the author’s willingness to discuss issues threadbare and highlight the flaws in using frequentist tests for real life data.

Something not related to this post – I wonder how many Statistics courses in Indian universities teach Bayes to students. Historically Bayes became popular in Bschools . It was only after it was absorbed by BSchools, that it was then was adopted in other disciplines. Considering this sequence of events, I really doubt whether Bschools in India ever think of highlighting atleast a few case studies based on Bayes. As far as my experience goes, Bschools scare the students with boring statistical techniques which are outdated. Andrew Gelman wrote a fantastic book titled, “A bag of tricks? ”. I will blog about it someday. The book has fantastic content that can be used to structure any statistics based course at BSc/MBA level. But how many faculty even bother to do such things ? What matters is that a few faculty personnel who have the courage to change the curriculum and teach what is relevant in statistics, than merely following the traditional frequentist ideas. If you take a random sample of students from any Bschool in India  and ask them whether they remember anything from statistics courses, I bet they would say, “Yeah it was something about pvalues , null hypothesis and something hypothesis testing”. Sadly, this reflects the state of our education system as it has failed miserably to inculcate statistical thinking amongst students. Anyways enough of my gripe, Let me get back to the intent of the post.

As I mentioned earlier, the Part I of the book can be quickly read and worked on. There is one thing worth mentioning about the first part of the book. The way in which Bayes’ is introduced is very intuitive and by far the easiest to visualize, out of all the intros I have read till date. Well, one comes across Bayes’ in some form or the other be it through Monty-Hall problem/ conditional probability problems etc. This is the first time I am coming across where Bayes’ is introduced from a Contingency table perspective. If you look at a contingency table and assume that rows represent various data realizations and columns represent parameter realizations, then one can formulate the Bayes’ theorem quite easily. Conditioning on the data is equivalent to restricting the attention to one row and Bayes probabilities are nothing but reevaluated probabilities in that row, based on the relative proportions of the probabilities appearing in the row. One new term that I got to learn from this chapter is “evidence” and the way it is used in relation to Bayes. The denominator appearing in Bayes is termed as evidence. The “evidence”, is the probability of the data according to the model, determined by summing across all possible parameter values weighted by the strength of belief in those parameter values. In other books, “evidence” is the called “prior predictive”. Another important aspect mentioned is Data order invariance, meaning the likelihood is independent and identically distributed.The likelihood function has no dependence on time ordering and thus is time invariant.

Part II of the book is the core of the book where all the fundamentals are applied to Inferring a Binomial Proportion. Most of algorithms are applied to a simple case of dichotomous data so that the reader gets an intuitive understanding as well as working knowledge of these algorithms.

Inferring Binomial Proportion Via Exact Mathematical Analysis

By taking beta as a prior distribution for binomial likelihood, one gets the posterior as beta distribution. Based on one’s belief one can choose the values of a & b to reflect the bias in the coin. Highly skewed priors can be represented with values of a and b less than 1, for uniform priors, one can take a=b=1, for skewed on one side priors, one can take the prior as a=100,b=1 or a=1, b=100 and for a fair coin, one can take a=b=some high value. So, specifying beta prior is a way to translate the beliefs in to quantitative measure.

The three goals of inference are 1) estimating the parameter of binomial proportion, 2) predicting the data , 3) model comparison. For the first goal, a credible interval is computed based on the posterior distribution. Predicting the data can be done using the expectation rule on posterior distribution. The third goal is tricky. Only a few guidelines are given, one being the computation of Bayes’ factor.

If there are two competing models, meaning two competing priors for the data, then one can use Bayesian inference to evaluate the two models. One needs to calculate the probability of evidence for both the models i.e calculate Bayes factor P(D/M1) / P(D/M2). If this factor is extremely small / big, one can get an idea that Model 1/ Model 2 is better. Let’s say the ratio is very close to 0. Based on this evaluation, one cannot come to conclusion that Model M1 is better. The model comparison process has merely told us the models’ relative believabilities, not their absolute believabilities. The winning model might merely be a less bad model than the horrible losing model.

Steps to check winning model robustness (Posterior predictive check )

  1. Simulate one value for the parameter from the posterior distribution
  2. Use that parameter to simulate data values
  3. Calculate the proportion of wins
  4. Repeat these steps million times and check whether the probability of evidence is big enough

It is possible that Best model of the competing models might be a less bad model than the horrible losing model. Basically this chapter gives the first flavor of using Bayes’ with using a known closed form conjugate prior for computing the posterior probability model.


Inferring Binomial Proportion through Grid approximation:

The chapter then talks about a method that frees the modeler from choosing only closed form priors. The prior need not be a closed form distribution such as beta but might as well be tri-modal/ discrete mass function. In all probability, the real life prior would mostly be a discrete mass function for binomial proportion estimation. Prior distributions for binomial parameter estimation thus will have some discrete values with some specific probability values. One can think of triangular, quadratic, parabolic, whatever discrete prior that you can think of. The beauty of the grid method is that you can evaluate the probability at each of the grid value and then multiply that probability by the likelihood function for that point on the grid. This gives the posterior probability for the same point on the grid. The grid method has another advantage, i.e the calculation of HDI( Highest density Interval) that is a discrete approximation. Estimation + Prediction + Model Evaluation are applied to dichotomous data to show the application of grid based method. The grid method thus is an amazing way to incorporate subjective beliefs in to the prior and compute the HDI for the binomial proportion.


Inferring Binomial Proportion Via Metropolis Algorithm:

clip_image002 Nicholas Metropolis published the algorithm in 1953

The methods introduced in the previous chapters involve specifying the prior that are in closed form, or priors that can be specified in a discrete form on a grid. However if the parameters explode, the number of grid points that one needs to evaluate grows exponentially. If we set up a grid on each parameter that has 1,000 values, then the six dimensional parameter space has 1,000^6 = 1,000,000,000,000,000,000 combinations of parameter values, which is too many for any computer to evaluate.

One can use Metropolis Algorithm or invoke the necessary functions in BUGS to sample values from the posterior distribution. The condition that, “ P(theta) and P(evidence/theta) can be computed at every theta “ is enough to generate sample values from the posterior distribution. The metropolis algorithm is introduced via “Politician visit example”, a simple but powerful example of showing the internals of Metropolis algorithm. Even though Metropolis is almost never used to infer binomial proportion, the example builds a solid intuition in to working of Metropolis.

The main components of Metropolis algorithm applied to the simple example are

  • Proposal distribution
  • Target distribution ( Typically the product of likelihood and prior )
  • Burn rate ( what % of the initial random walk should be abandoned ?)

One can also form a transition matrix which gives an intuition on,” Why the random walks converge to the target distribution? “, “ Why does the repeated multiplication of transition matrix converge to a stable matrix? ”. One thing to keep in mind about Metropolis algorithm is that prior and likelihood need not be normalized probability distributions, all that matters is the ratio of target probabilities of proposed and current probabilities. The book says it better, “The Metropolis algorithm only needs the relative posterior probabilities in the target distribution, not the absolute posterior probabilities, so we could use an unnormalized prior and/or unnormalized posterior when generating sample values of θ

One must however keep in mind that, Proposal distributions can take many different forms, but the goal should be to use a proposal distribution that efficiently explores the regions of the parameter space where probability has most of its mass. MCMC methods have revolutionized Bayes’ methods and Metropolis algorithm is an example of one. Computation of evidence is the toughest part of Bayesian analysis. Hence MCMC methods can be used to solve this specific kind of problem. Thus “politician example”, mentioned in the book is a good introduction to Markov Chain Monte Carlo methods.

Basic Intro to BRugs

For the metropolis to work, the following conditions become crucial

  • The proposal distribution needs to be tuned to prior distribution
  • Initial samples during the burn-in-period needs to be excluded
  • Sampling chain needs to be run long enough

These issues are taken care of by using BUGS package. Now to use BUGS, one needs a wrapper so that R can communicate with BUGS. There are different BUGS versions for windows, linux and other open source softwares. OpenBUGS is used in this book for specifying Bayesian models and generating MCMC. The wrapper that is used to communicate R with OpenBUGS is BRugs package. I am using a 64 bit machine and it was particularly painful to know that BRugs is available only for 32 bit machines. So, the code I wrote had to be configured to talk to 32 bit installation of R so that BRugs could be invoked. BTW, it requires a 32 bit JRE if you are working with eclipse.

The syntax for OpenBUGS is somewhat similar to R , so learning curve is not so much. Infact all one needs to know atleast for preliminary applications is the way to specify, likelihood, prior and the data realization. The chapter does a fantastic job of showing the way to use OpenBUGS from R. modelData(), modelCompile(),modelGenInits(),modelUpdate() are the main functions in BRugs to put data in to model, to compile the model, to initialize the chain and creating a MCMC chain. All these function details are already coded in BUGS can be leveraged using BRugs wrapper.BUGS can also be used for prediction and model comparison.

The highlight of this chapter is the application of Metropolis to single parameter estimation. Typically Metropolis is used only for multi parameter estimation and hence is usually introduced in any book after significant material is covered. By applying Metropolis to a simple example where the proposal distribution takes only 2 values, a lot of things get intuitively clear for the reader.

Also the extensive R code provided for Metropolis algorithm, computing HDI, invoking BUGS using BRugs is extremely useful to get a feel of the workings. The code is extremely well documented and the fact that they are part of the chapter itself makes the chapter a self-contained one.


Inferring Binomial Proportion Via Gibbs Sampling Algorithm:

Stuart Geman and Donald Geman were enthused on Bayes’ application to Pattern recognition after attending a seminar. They both went on to invent a technique called “Gibbs Sampling Method”

clip_image004 Stuart Geman clip_image006 Donald Geman

In any real life modeling problem one needs to deal with multiple parameters and one needs a methods for doing Bayesian analysis. BUGS( Bayes using Gibbs sampler) provides powerful software to carry out these analysis.

What is Gibbs Sampling ? You consider the parameters to be estimated , let’s say theta_1, theta_2,…theta_n. Select a theta_i from the list and generate a random value for the parameter from the conditional distribution of theta_i given other thetas. This procedure is repeated in a cyclical manner for all the thetas until you get to the target distribution. In Metropolis, the proposed jump affects only one parameter at a time, and the proposed jump is sometimes rejected, unlike Gibbs sampling where the proposed jump is never rejected. Gibbs Sampling is extremely efficient algorithm when it can be done, but the sad part is that it cannot be done in all the cases.

The thing about multiple parameter is that to look at the probability distribution you need a 3d plot, or a projection on a 2d surface. So, the chapter starts off with giving some fundas about basic plots using contour and persp functions in R package. One needs to know “outer” function in R to plot these 3d plots. The chapter starts off with deriving posterior for a two parameter case using formal analysis. The prior, likelihood and posterior are displayed using persp and countour plot. Grid method is used when the prior cannot be summarized by a nifty function. Let’s say you have prior function that looks like ripples and you want to estimate the posterior. One can easily form a grid, evaluate the prior at each point, evaluate the likelihood at each point and multiply both to get the posterior distribution.

So, if grid methods or closed form prior methods do not work ( in the case of multiple parameter) , one can use Metropolis algo where you choose a prior, select a burnout period, select a likelihood function and use BUGS/ R code to generate the posterior distribution. So, if there are already three methods , what the advantage of Gibbs sampling ? Answer : One can ignore the effort that goes in to tuning the prior distribution.

The procedure for Gibbs sampling is a type of random walk through parameter space,like the Metropolis algorithm. The walk starts at some arbitrary point, and at each point in the walk, the next step depends only on the current position, and on no previous positions.Therefore, Gibbs sampling is another example of a Markov chain Monte Carlo process.What is different about Gibbs sampling, relative to the Metropolis algorithm, is how each step is taken. At each point in the walk, one of the component parameters is selected. The component parameter can be selected at random, but typically the parameters are cycled through, in order: θ1, θ2, θ3, . . . , θ1, θ2, θ3, . . .. (The reason that parameters are cycled rather than selected randomly is that for complex models with many dozens or hundreds of parameters, it would take too many steps to visit every parameter by random chance alone, even though they would be visited about equally often in the long run.) Gibbs sampling can be thought of as just a special case of the Metropolis algorithm, in which the proposal distribution depends on the location in parameter space and the component parameter selected. At any point, a component parameter is selected, and then the proposal distribution for that parameter’s next value is the conditional posterior probability of that parameter. Because the proposal distribution exactly mirrors the posterior probability for that parameter, the proposed move is always accepted.

The place where Gibbs sampling is extremely useful is when it is difficult to estimate the joint distribution but it is easy to estimate the marginal distribution of each parameter. Advantage big advantage of Gibbs sampling is that there are no rejected simulations unlike the Metropolis algorithm. Thus the basic advantage of Gibbs sampling over the Metropolis algorithm, is that there is no need to tune a proposal distribution and no inefficiency of rejected proposals. But there is one other disadvantage of Gibbs sampling. Because it only changes one parameter value at a time, its progress can be stalled by highly correlated parameters.

A visual to illustrate what’s wrong with Metropolis incase one doesn’t tune the proposed distribution properly. Incase you choose the proposed distribution with low standard deviation, the path just gets stagnated at a local point (fig. on left) . If you choose the proposed distribution with high probability, the most of the jumps are ignored(fig. in the center). The key is to use the correct the proposed distribution so that the chain generated is long enough and wanders enough to get suitable values from the posterior distribution(fig.on the right)

clip_image008clip_image010clip_image012

The above visual makes a strong case to trust MCMC in BUGS instead of Metropolis algorithm where the modeler might not be sure about the proposal distribution.


Bernoulli Likelihood with Hierarchical Prior

clip_image014 Dennis V Lindley clip_image016 Adrian F. Smith

clip_image018Lindley and his student Adrian F Smith were the first scientists who showed Bayesians the way to develop hierarchical models. The hierarchical models fell flat in the face as the models were too specialized and stylized for many scientific applications. It would be another 20 years before students began their preliminary understanding of Bayes’ by looking in to hierarchical models. The main reason Hierarchical models became workable was “the availability of computing power”.

This chapter is all about Hierarchical models and applying various techniques for parameter inference. The chapter starts off with an example where the binomial proportion parameter of a coin depends on another parameter(hyperparameter). This hyperparameter is a beta distribution and the dependency between hyperparameter and parameter is through a constant K. Looking at the figure on the left mu is the hyperparameter, theta is the parameter and K is the constant that decides the dependency between mu and theta. For large values of K, there is a strong dependency as the resulting beta distribution is very narrow , whereas for small values of K, there is a weak dependency as the resulting beta is flat. One thing to notice is that there is no need for any hierarchical modeling if K takes a very high value or mu takes only one value. In both these situations the uncertainty of theta is captured in only one level. The first approach explored is the grid approach. Basically you divide the parameter space in to fine grids, formulate priors and joint priors, multiply by joint likelihood and you obtain a join posterior.

This is further extended to a series of coins and this where one sees that plain vanilla grid approach fails spectacularly. For a 2 coin experiment, it is still manageable but once we go on to 3 coin model, the grid points explodes to 6.5 million(on a 50 point grid for each param). It explores to 300 million for a 4 coin model. Hence this approach even though appealing and simple in one way becomes inefficient as the parameter space widens even a bit. MCMC comes to the rescue again. BUGS is then used to estimate parameters for more than 2 coins. The approach followed is , write BUGS code , wrap in a string, use BRugs package to interact with BUGS, simulate the MCMC chain , get the results from BUGS and analyze the result. BUGS gives an enormous flexibility. One can even include K as a parameter instead of fixing it. The chapter then introduces modeling a bigger parameter space.. It ends with a case study where Hierarchical models are applied to a real life situation.

The problem with examples mentioned are BRUGS is that it works with OpenBUGS that is available for a 32 bit machine. My machine is 64 bit and I had to do some acrobatics to make the examples work. I used RStudio which has a 32 bit R version and 32 bit JRE and then tried running the whole stuff. Even though I could run the code, I could never understand stuff as I was struggling to code. That’s when I stumbled on R2WinBUGS which had a wrapper to WinBUGS. I am relieved that I can now run R code to interact with WinBUGS with out having to worry about 32bit/62 bit constraint. Once I figured out the way to code almost everything in R and use R2WInBUGS to run MCMC sampling, working on Hierarchical models was a charm. While experimenting various software and packages, here are a few things I learnt :

  • BRUGS works with OpenBUGS. All the functions in BRUGS replicate the interface functions in WINBUGS software. BRUGS works for 32 bit machine and not 64 bit machine
  • OpenBUGS is published under GPL. It runs on Linux too. Provides a flexible API so that BRUGS can talk to OpenBUGS
  • "“coda” package is used to analyze WinBUGS output.
  • JAGS – Just another Gibbs Sampler : Alternative to the existing versions in WinBUGS/ OpenBUGS. It is used to run with coda and R
  • R2WinBUGS – Interface written by Gelman to interact with WinBUGS. The main functions of R2WinBUGS are bugs() , bugs.data.inits(), bugs.script(), bugs.run(), bugs.sims()
  • It’s better to code the model in WinBUGS , check the basic syntax of the model and then code the whole thing in R

Thinning is another aspect that is discussed in the chapter. Since posterior distribution samples are dependent, it is important to reduce the autocorrelation amongst the samples. This is used by specifying the thinning criteria where the samples are selected after some lag.

Hierarchical Modeling and Model Comparison

This chapter talks about model comparison, meaning if you specify two or more competing priors, on what basis does one choose the model ?

For comparing two models, an elementary approach is used to give an intro. Let’s say one is not certain about the binomial proportion parameter and chooses to have a discrete prior with two peaks at 0.25 and 0.75. One can use the code the model in WinBUGS in such a way that posterior samples are generated from these two priors at once. So, one simulation gives out the conditional probability values of both the models. These posterior probability values can be then used to judge the competing models. The chapter then goes on to use pseudo priors and shows a real life example where Bayes’ can be used to evaluate competing models/priors.

 

Null Hypothesis Significance Testing

clip_image019Ronald Aylmer Fisher

It was Fisher who introduced p values, which became the basis for hypothesis testing for many years. Now how do we tackle this issue in this new world where there is no paucity of data, amazing visualization tools and computational power. What has Bayes’ got to offer in this area ?

The chapter is extremely interesting for any reader who wants to compare the frequentist methods to Bayesian approaches. The chapter starts with ripping apart the basic problem with NHST. From a sample of 8 heads out of 23 flips, the author uses NHST to check whether the coin is biased. In one case, he says the experimenter had a fixed N in mind which leads to rejection of alternate hypothesis, while in the second case the experimenter was tossing the coin until 8 heads appeared, in which case NHST leads to rejecting the null hypothesis. Using this simple example the author makes a point that the inference is based on the intention of the experimenter which the realized data will never be able to tell. This dependence of the analysis on the experimenter’s intentions conflicts with the opposite assumption that the experimenter’s intentions have no effect on the observed data. So, NHST in that way looks dicey for application purposes. Instead of tossing coins, if we toss flat headed nails and assume that tails means nail falls on the ground and heads is the alternate case, then NHST gives the same result even though our prior about flat headed nails is that the parameter is strongly biased towards tails. NHST gives no way of incorporating prior beliefs in to the modeling process. So, in a lot of cases it really becomes a tool which gives nonsense results.

The same kind of reasoning holds good for confidence intervals. Based on the covert intention of the experimenter, the confidence intervals could be entirely different for the same data. On the other hand, Bayesian Highest Density Interval is very clear to understand. It has direct interpretation of the believabilities of theta. It has no dependence on the intention of the experimenter and it is responsive to analyst’s prior beliefs. The chapter ends with giving a few advantages of thinking about sampling distribution.


Bayesian Approached to Testing a Point(“Null”) Hypothesis:

The way to test the null hypothesis is to check whether the null point falls within the credible interval. One important point mentioned in this context is that one must not be blinded by visual similarity between marginal distributions of two parameters and infer that they are same. It might so happen that they might be positively correlated in the joint behavior and the credible interval for the difference in the parameters might not contain 0. Instead of inferring that the two parameters are different, visual similarity might blind us to infer that they are the same. I came across another term in the chapter for the first time till date, Region of Practical Equivalence(ROPE). Instead of checking whether null value lies in the credible interval, it is better to check whether the an interval around null value , i.e whether ROPE comprises the credible interval or not?. By talking about ROPE instead of a single point, we are basically cutting down the false alarm rate. To summarize, a parameter value is declared to be accepted for practical purposes, if that value’s ROPE completely contains the 95% HDI of the posterior of the parameter.


Part III
of the book that talks about model building, more specifically GLM

The Generalized Linear Model

The entire structure of modeling that is usually found in statistics 101 courses is based on Generalized Linear Model structure. In such a structure, you typically find predictor and predicted variable linked by a function. Depending on the type of predictors and the functions, GLM model goes by a different name. If the predictors and predicted variable are metric scale variable, then GLM usually goes by the name, “Regression” / “Multiple Regression”. If the predictor is nominal then it is recoded as appropriate factors and then fed in to the usual regression equation. There are cases when predicted variable is connected to the predictors using a link function. Typically this takes the form of sigmoid(logit), cumulative normal etc. Sigmoid is used when the predicted variable takes values between 0 and 1. The various illustrations to show the effect of gain + threshold + steepness in a sigmoid function are very helpful to any reader to get an idea of various GLM models possible.

The following is the usual form of GLM

clip_image002

The mean of the predicted variable is a function (f) of predictors(x1, x2,..) where f is termed as the link function. The predicted variable is in turn modeled as a pdf with the above mean and a precision parameter. Various forms of GLM models are an offshoot of the above structure. Typically f is Bernoulli / sigmoid/ constant / cumulative normal function. Various chapters in the part III of the book are organized based on the form f takes and the type of predictors + predicted.Chapter 15 is Bayes’ analogue of t-test. Chapter 16 is the Bayes’ analogue of Simple linear regression. Chapter 17 is the Bayes analogue of Multiple Regression. Chapter 18 is the Bayes analogue of One-way Anova. Chapter 19 is the Bayes’ analogue of two way ANOVA. Chapter 20 deals with logistic regression.

I will summarize the last part of the book (~200 odd pages) in another post at a later date. The first two parts of the book basically provide a thorough working knowledge of the various concepts in Bayesian Modeling. In fact a pause after marathon chapters of part I and part II of the book might be a better idea as one might need some time to reflect and link the concepts to whatever stats ideas that one already has.

imageTakeaway

The strength of this book is its simplicity. The book is driven by ONE estimation problem, i.e, inferring a binomial proportion. Using this as a “spine”, the author shows a vast array of concepts / techniques / algorithms relevant to Bayes’. In doing so, the author equips the reader with a working knowledge of Bayesian models. The following quatrain from Prof. McClelland (Stanford) on the cover of the book tells it all :

John Kruschke has written a book on Statistics
Its better than others for reasons stylistic
It also is better because it is Bayesian
To find out why, buy it – its truly amazin!

image

This book by Michael Lewis delves in to the reasons behind the mysterious success of Oakland Athletics, one of the poorest teams in US baseball league . In a game where players are bought at unbelievable prices , where winning / losing is a matter of who’s got the bigger financial muscle, Oakland A’s go on to make a baseball history with rejected players and rookie players .

“Is their winning streak a result of random luck OR Is there a secret behind their winning streak ? “ , is a question that Michael Lewis tries to answer. Like a sleuth, the author investigates the system and the person behind the system , Oakland A’s manager Billy Beane.

The author traces of life of Billy Beane from his college days to the time when he becomes the manager of Oakland A’s. Billy is one of the most promising guys in his hey days and every one  believed that Billy was the superstar in the making. However Billy to everyone’s surprise doesn’t make it. He quits his job and takes up a desk job in the Oakland A’s team that is responsible to pick the talent and manage the team. Scouts, as they are called, are people who draft young and promising players in to the team. Billy in the course of time realizes that the methods followed by scouts are subjective, more gut feel based, touchy feely kind of criteria. Even some of the broad based metrics used to rate players appears vague to Billy. He sets on a mission to create metrics which truly reflect the value of the players. Billy relies on statistics, metrics to shortlist players and he does so in a manner that is similar to futures and options trader.

Much before Black-Scholes showed that the option can be replicated using a fractional unit of stock and bond, the option prices were way above the fair value. However once traders knew a rough picture of fair value, the difference was arbitraged away.  In the same way, the players belonging to the financially rich teams are overvalued in one way. Billy systematically chips away the broad metrics and tries to find ways to replicate these overvalued players with the help of new undervalued players, rookies who fit his metric criteria. So, if you think about the book from a finance perspective, Billy Bean is a classic arbitrage trader who shorts overvalued players and longs undervalued players. Infact the way in which Billy operates by selling and buying players is similar to any trading desk operation. He starts off the season with really average players and as the season proceeds, he starts evaluating the various team players available to trade , develops aggressive buy sell strategies ( of players) and creates a team which has statistically a higher chance to win. I found the book very engrossing as it is an splendid account of Billy Bean’s method of diligently creating a process that is system dependent rather than people dependent.

This book is being adapted in to a movie starring Brad Pitt as Billy Beane. It is slated to be released this September. If the movie manages to replicate at least 30-40% of the pace and content in this book, I bet the movie will be a run-away hit.

image  image

These books are written by the Father-Son duo of Irwin & Peter Schiff, albeit at various points in time. The first book titled, ”How an economy grows and Why it doesn’t ?” was written by Irwin A Schiff in 1979 and his son has followed it up in 2010 with the second book titled,” How an economy grows and Why it crashes?”. Both books give a chance to ponder over the situation that US is in, financial crisis of 2008 which is still on with full force in 2011. I have summarized the first book here. In this post, I will try to summarize second book , that the author, Peter Schiff calls it “a riff of the original one”.

Out of 220 odd pages of the book, first 130 odd pages is mostly a repeat of first book, in words and few illustrations. I prefer the graphic novel approach of the first book than the textual description of the situation. The first book ends with senator finally comes out and openly declares that he can do nothing but ask the islanders to go back to fishing. This book takes a different route, obviously looking at what has happened in the last 20 years- ”the rise of China”. Just when the senator is thinking of talking openly to the public, there is lifeline that appears – Sinopia. BTW the main island of the story is given an obvious name Usonia. Sinopia was a strange island where all citizens were required to fish but their catch did not belong to them. Instead the fish were turned over to the kind who then decided which subjects deserved to get some back. This system did not provide much fish per capita but there was huge income (fish) disparity in the island. Seeing the progress of Usonia, Sinopian king decided that he would somehow possess Fish reserve notes that were supposedly seen to be key to advancement. Thus he knocked on the doors of Usonia to exchange real fish to Fish reserve notes. This was a lifeline to Usonia who realized that their notes were actually worthless. This lifeline got back fish in to the reserves, prices began to cool down, consumption was back, island was back to functioning state..all becoz of Sinopia.Sinopian king now used these notes to buy tools from USonia and left over notes were parked in Fish reserve bank as Sinopia had no robust banking system. Sinopian king made a policy that whoever purchased tools from the king could keep all the extra fish they caught. Also he did a devious thing. He required his citizens to swap their extra fish for Fish reserve notes. Basically the king hedged his risk and the counterparty was his own islanders. With flushed Fish reserve notes, the Sinopians started pumping in money in USonia as savings and thus Usonia was flush with funds and credit, thus creating a spending binge atmosphere on the island. With most of production work happening in Sinopia, Usonia now concentrated on Service sector. With Sinopians willingness to accumulate notes, trade relationship was a skewed one where one island largely produced and other island consumed.

One of the residents of an island, Bongobia , realizes that there was a threat that Usonia might not be able redeem the note with fish and hence started to hammer the fish reserve counter to exchange notes with fish quickly. Senator-in-Chief has no option but to close the Fish reserves window. From then on, the Fish Reserve notes on the international market would be determined only by what someone was prepared to trade for them, not because they could be redeemed for fish. In truth, the notes’ value hanged on Usonia’s status as a great economic and military power. No disaster happened after closing the fish window and Usonia experienced unprecedented growth on the back of consumer spending, demand of Fish reserve notes.

The book then moves on to creating characters and extending the story to narrate the housing market crisis in US.

With the service sector in full swing, Usonian bank loan officers cast their eyes on island’s sleepy hut loan market. To become popular, Senator Cliff Cod devised a plan where govt would ensure that everyone could get a hut loan. He creates Finni Mae and Fishy Mac to buy hut loans from the market.The hut lending program was a massive hit amongst banks as they were earning risk-free profits. Another agency Sushi Mae started to underwrite loans to youngsters who wished to enroll in surfing schools. These agencies created a big industry where hut building, hut selling and hut decorating industries took off. BTW, no actual fish was being generated, nothing productive was happening. Loans were being now being made not because they were necessarily the best use of savings, but because the senators had a political stake in encouraging hut ownership and education. The senate gave tax breaks on hut loans and thus stimulated the activity to crazy levels. With the influx of Sinopian fish, there was amazing amount of credit available and risk was conveniently ignored.

In this environment enters , Manny Fund. Manny starts to offer a different type of loan “hut fish extractions” in which hut owners refinanced existing loans with bigger loans, given the appreciation of hut value since their investment. Thus it was all more easy credit available for islanders. Huts started becoming more luxurious and hut values reached stratospheric levels. Islanders now started looking at hut as short term investment rather than a place to stay….. It had to happen someday. The hut market took a down turn and every associated industry felt the pain. Unlike any sectoral downturn, this was a credit based pain . It was like removing oxygen from the air and hence everybody started to suffer. The govt could only do one thing – urge consumers to spend more. It wanted to somehow spend through the crisis. Stimulus was not working. In need of fish, it desperately took a loan from Sinopia. Its ploy as usual was to pay back using Fish Reserve Notes.

Despite the bailouts, Usonia was going nowhere out of the crisis. In such a scenario, Barry Ocuda becomes the new Senator-in-Chief after promising the islanders on the theme of “transformation”. He started pushing the Fish reserve notes to the public by giving them assistance explicitly and implicitly(incentives to first time hut buyers), increase direct aid to schools , tried to create some construction jobs etc. Was this spending the most efficient use of island’s resources ? Instead of market deciding the resources, a small group of people started making decisions. While this massive plan was put to practice, there was one problem. Usonia was completely out of fish. Instead of taking harsh measures, it took an easier option. Borrow more. It reached a situation where most of the debt was funded by other islands.

However things started to change in Sinopia. In a turn of events Sinopia stopped buying Fish Reserve Notes as it was seen as unnecessary for all the goods and services could be produced and consumed in Sinopia itself. Why bother about Usonia, when all the resources are in the homeland ? With Fish Reserve notes demand falling Usonia was basically stuck. Other Islands followed Sinopia in cutting down the demand for Fish Reserve Notes. Amidst such a situation, a Sinopian ship landed on Usonia shores with loads of fish. As usual Usonian govt thought that Sinopia was interested in swapping fish for Reserve notes. In a devastating move for Usonia, Sinopians instead chose to use their original fish to buy out all the major corporation in USonia, huts, services etc. In one move, Usonia became an impoverished island. Barry Ocuda had no choice but to openly ask the islanders to start fishing again.

The book’s message is very clear. If US keeps its spending and borrowing levels the same, it is soon going to be hit by hyperinflation and the country will face an economic devastation.

imageTakeaway

This book brings the china factor in to the equation and shows the deadly consequences to the US in the times to come. Hyperinflation, currency default and ultimate economic devastation will be a reality to US unless it takes harsh measures. “Will it ? “ , is something that time would reveal.

image

This is a graphic novel explaining the growth of a general economy from its barebones structure to a full-fledged economic system. Through an allegory , the author shows the deep malaise in the functioning of US economy. It starts off with three men on an island Able, Baker and Charlie who merely catch fish,eat, sleep. There is no credit, no investments, no savings to begin with. Able gets a brainwave to make a fishing net that will save him some time to pursue other activities. He under consumes for a day, takes risk , comes up with a fishing net, thus managing to catch 2 fish per day instead of 1 fish. This is the first time the island has a saving and a capital equipment (net). It is also the first time when one of the islanders can do some other activity than merely taking care of survival.

With the new savings, Abel can choose one of the options 1) Save what he has saved 2) Consume what he saved 3) Loan out what he has saved 4) Invest what he has saved 5) Combination of the above options. The author quickly argues that the only way Abel can increase his wealth is by making his wealth available to other members of his community. Baker and Charlie make use of the loan and build their own nets. With more savings, Baker and Charlie build a bigger net (Bigger Capital project) and thus are able to generate far more savings for themselves and the island. This savings were possible only because Abel in the first place did not lend loans for frivolous activities like vacationing etc. Capital Loans were preferred to Unnecessary Consumer Loans as the former increases savings while the latter reduces savings. The point that the author makes is, “Loans for consumption purposes reduce the amount of funds available to finance both capital projects and the production of more consumer goods, and hence can ONLY lower society’s standard of living.

The increased prosperity of the island gives rise to a need for “effective storage of fish”, and hence fish savings&loan operations kicks off by MaxGoodBank. Its intention is a noble one where it lends to the needy people who pay interest and takes fish deposits and pays the depositors specific number of fish for entrusting the fish with MaxGoodBank. So, island has now savings, credit and investments well oiled in to the society.There is also Manny Fund which invests in risky projects and islanders who have the appetite to take risk invest their fish in Manny Fund and get returns / lose their fish based on the performance of the projects that Manny Fund invests in. However the losses of Manny Fund doesn’t threaten the society’s credit structure.

So, All is well until the islanders decide that they need a govt to take care of law and order situation, protect life and property. The problem starts when Democracy takes an ugly dimension the elected senators become so powerful that they control courts and dispense laws whatever they deem appropriate. Franklin Dee V becomes a senator and he starts making some nasty moves to gain power and sustain power over periodic elections. He passes a law whereby MaxGoodBank is forced to give low interest loans for business, school, minorities and other sections of the society. This in itself is not bad, but by extending the line of credit to people who have no hope of paying it back, wreaks havoc in the system. Franklin Dee V creates another phantom , the Franklin Reserve Notes which say that govt would guarantee fish in return to the printed notes. Basically in a way , Govt has forced MaxGoodBank to exchange fish to some printed paper. Soon, he realizes that his coffers are empty and there is no way he can give back the promised fish for each Franklin Reserve Note. He goes to senators with the plea that their “Franklin Reserve System” could be a catastrophe to the Island.

In this situation, Senators employ fish technicians who work on fish skins and skeletal and turn them in to fish by putting meat from original fish. Basically they are creating fraud fish which are lesser in value than the original fish. In order to cover up their fraud , the senators pass a law that only official fish , marketed as officially decontaminated fish, would be used in the society.They also pass fraudulent laws which create a situation where islanders deposit skeletal remains of fish with Fish Bank. MaxGoodBank sees through this ploy and refuses to cooperate. He is promptly removed and Chesley Bartin is appointed as the new director of Fish Bank. Soon with official fish floated around in the island that were half the size of original fish and naturally there was inflation. The price of everything doubled and consumers were at loss to explain the reasons behind it. Meanwhile the govt congratulated themselves and the islanders saying that inflation was a sign of prosperity. This went for quite sometime until the official fish was 1/3 the size of original fish and prices in the island skyrocketed to 200% from what they were before Franklin Reserve notes.

Islanders saw that they were getting 1/3 of the fish that were available in the sea and started using offshore fish banks. Senators realized the threat it posed to their well being and promptly passed a law to regulate fish that were being deposited in offshore banks. Soon, there was an underground economy for original fish as some islanders did not think that they should give away fish for Franklin Reserve Notes. Subsequently, the shortage forced the prices to jump by 600%. People were laid off, there was unemployment everywhere. What did the govt do ?They chose the easy way out – “Unemployment Insurance” and printed away to glory. These notes created a run on fish bank and there was a severe fish bank crisis in the Island. Chesley Bartin, the director had no clue and ran to the rescue of the govt for a possible solution. The govt ran out of fraudulent tricks and called the economists for a solution. Economists offered “Expand Credit” and “Lower Taxes” solution which obviously were useless as Island was already crippled with no credit. How can you expand credit in a situation credit situation had been manipulated to the hilt?. The situation becomes unmanageable and Franklin Dee V finally asks the islanders to start fishing again and quickly. The Island is back to square one, but with more disastrous situation,  where there are only few people who know how to fish but MANY people who need fish.

The book ends with reiterating that

  • Vote seeking Politicians bring mayhem by interfering in the economic systems and tools
  • Minimum wage law is actually counter-productive
  • Consumer credit wastes credit and it costs society in the sense that there is less credit available for Commercial purposes

Hence the author suggest a possible solution for this island : Government should outlaw bank credit to finance activities that fall under consumer credit. This will kick off a wave of layoffs, painful readjustment of life styles, but ultimately the island would be in a better situation as everything would adjust downwards. Basically a deflation is the cure suggested as the solution to the islanders. However this seems impossible to happen as the ones that will be deeply hurt in the process is the government( that has become bankrupt), is in charge of passing these strict measures. The book leaves the reader with an obvious question – “Is the Island economy doomed, now that it is in this crisis and govt. would never take harsh measures?”

imageTakeaway :

The book was written in 1979 and the situation mentioned in the book is similar to what US is facing now. “Does it take that long for markets, people, other trade countries to see that ‘Emperor has no clothes’ “?, is a question this book leaves you with.

image

This book is a poor imitation of Steven Pressfield’s book , “Do the work”.The author is a professional actor who seemed to have settled in to some creative director gig at a Church. The book is called “Untitled” and metaphorically refers to the blank page that faces any person at the beginning of a project, be it writing an article/book/painting/creating a biz etc. Everyone has to start with a blank page. In an writer’s life though, every novel and every little story of the novel begins with a blank page and he/she must fill in characters , narrative as he goes along. In that sense, “blank page“ is a demon that a writer needs to fight every day. 

This book is collection of cliché’s , quotes picked up from various philosophers/authors/books etc.  Except using iTunes library as a Launchpad for creative writing that the author mentions, the book has nothing really original. Through out the book, the author harps on one point that ideation , execution , is plain old hard work and there is no magic bullet.  The author being a creative director and an actor himself, should have at least been a bit creative in writing some original content than merely quoting other people.

image

The book by sharon bertsch mcgrayne, is about Bayes’ theorem stripped off the math associated with it. In today’s world, statistics even at a rudimentary level of analysis (not referring to research but preliminary analysis) comprises forming a prior and improving it based on the data one gets to see. In one sense modern statistics takes for granted that one starts off with a set of beliefs and improves the beliefs based on the data. When this sort of technique or thinking was first introduced, it was considered equivalent to pseudo-science or may be voodoo science. During 1700s when Bayes’ theorem came to everybody’s notice, Science was considered extremely objective, rational and all the words that go with it. On the other hand, Bayes’ was talking about beliefs and improving the beliefs based on data. So, how did the world come to accept this perspective of thinking? In today’s world, there is not a single domain that is untouched by Bayesian Statistics. In finance and technology specifically, Google Search engine algos + Gmail spam filtering , Net flix recommendation service in e-tailing space , Amazon book recommendations, Black-Litterman model in finance, Arbitrage models based on Bayesian Econometrics etc. are some of the innumerable areas where Bayes’ philosophy is applied. Carol Alexander in her book on Risk Management says that, world needs Bayesian Risk Managers and remarks that , “Sadly most of risk management that is done is frequentist in nature” .

You pick any book on Bayes and the first thing that you end up reading is about prior and posterior distributions. There is no history about Bayes that is mentioned in many books. It is this void that the book aims to fill and it does so with a fantastic narrative about the people who rallied for and against a method that took 200 years to get vindicated. Let me summarize the five parts of the book. This is probably the lengthiest post I have ever written for a non-fiction book, the reason being , I would be referring to this summary from time to time as I hope to apply Bayes’ at my work place.

 

Part I – Enlightenment and Anti-Bayesian Reaction


Causes in Air

The book starts off with describing conditions around which Thomas Bayes wrote down the inverse probability problem. Meaning deducing the cause from the effects, deducing the probabilities given the data, improving prior beliefs based on data. The author speculates that there could be have multiple personalities that would have motivated Bayes to think on the inverse probability problem like the David Hume’s philosophical essay / deMoivre book on chances / Earl of Stanhope / Issac Newton’s concern that he did not explain the cause of gravity etc. Instead of publishing or sending it to Royal Society, he let it lie amidst his mathematical papers. Shortly after his death, his relatives ask Richard Price to look in to the various works that Bayes had done. Price subsequently got interested in the inverse probability work and polished it , added various references , and made it publication ready . He subsequently sent it off to Royal Institute for publication.

 

clip_image002 clip_image004

Thomas Bayes (1701-1776)

Richard Price (1723-1791)

The Bayes’ theory is different from the frequentist theory in the sense that, you start with a prior belief about an event, collect data and then improve the prior probability. It is different from the frequentist view as frequentists typically do the following :

  • Hypothesise a probability distribution – a random variable of study is hypothesized to have a certain probability distribution. Meaning, there is no experiment that is conducted but all the possible realized values of the random variable are known. This step is typically called Pre-Statistics( David Williams Terminology from the book “Against the Odds” )
  • One uses actual data to crystallize the Pre-Statistic and subsequently reports confidence intervals.

This is in contrast with Bayes where a prior is formed based on one’s beliefs and data is used to update the prior beliefs. So, in that sense, all the p-values etc that one usually ends up reading in college texts are useless in Bayesian world.


The Man who did Everything

clip_image002[4] Pierre-Simon Laplace (1749 – 1827)

Baye’s formula and Price reformulation of Baye’s work would have been in vain , had not it been for one genius, Pierre Simon Laplace, the Issac Newton of France. The book details the life of Laplace where he independently develops Bayes’ philosophy and then on stumbling upon the original work tries to apply IT to everything that he could possibly think of. Ironically, Baye’s used his methodology to prove the existence of God while Laplace used it to prove God’s non-existence(in a way ). In a strange turn of events, at the age of 61 Laplace became a frequentist statistician as data collection mechanism improved and mathematics to deal with rightly measured data was easier in the case of frequentist approach. Thus a person who was responsible for giving mathematical life to Bayes’ theory turned in to a frequentist for the last 16 years of his life. All said and done, Laplace single handedly created a gazillion concepts, theorems, tricks in statistics that influenced the mathematical developments in France. He is credited with Central Limit theorem, generating functions, etc , terms that are casually tossed around in today’s statisticians.


Many Doubts, Few Defenders

Laplace launched a craze for statistics by publishing some ratios and numbers in the French society like number of dead letters in postal system, number of thefts, number of suicides, proportion of men to women etc and declaring that most of them were almost constant. Such numbers increased French Government’s appetite to consume more numbers and soon there were plethora of men and women who started collecting numbers, calling them statistics. Well, no one cared to attribute the effects to a particular cause; no one cared whether one can use probability to qualify lack of knowledge. The pantheon of data collectors lead quickly to an amazing amount of data and subsequently the concept that probability was based on frequency of occurrence took precedence over any other definition of probability. In the modern parlance, the long term frequency count became synonymous with probability. Also in the case of large data, the Bayesian and frequentist stats match and hence there was all the more reason to push Bayesian stats in to oblivion that was anyway based on subjective beliefs. Objective frequency of events was far more plausible than any sort of math that was based on belief systems. By 1850, with in two generations of his death, Laplace was remembered largely for astronomy. Not a single copy of his treatise on probability was available anywhere in Parisian bookstores.

1870s- 1880s-1890s were dead years for Bayes’ philosophy. This was the third death for Bayes’ principle. First was when Thomas Bayes did not share with anyone and it was idle amongst his research papers. The second death for the principle was, subsequent to Price publication of Bayes’ in scientific journals and the third death happened by 1850s as theoreticians rallied against it.

Precisely during these years, 1870s – 1910s , Bayes’ theory was silently getting used in real life applications and with great success. It was getting used by French army and Russian artillery officers to fire their weapons. There were many uncertain factors in firing artillery, like enemy’s precise location, air density, wind directions etc. Joseph Louis Bertrand a mathematician in French army put Bayes’ to use and published a textbook on the artillery firing procedures that was used by French and Russian army for the next 60 years till the Second World War. Another area of application was Telecommunications. An engineer in Bell Labs created a cost effective way of dealing with uncertainty in call handling based on Bayes’ principles. The US insurance industry was another place where Bayes’ was effectively used. A series of laws were passed between 1911-1920 obligating employers to provide insurance covers to employees. With hardly any data available, pricing the insurance premium became a huge problem. Issac Rubinow provided some relief in this situation by manually classifying records from Europe and thus helping to tide over the crisis for 2-3 years. In the mean time, he created an actuarial society which started using Bayesian philosophy to create Credibility score, a simple statistic that could be understood, calculated and communicated easily amongst people. This is like the implied volatility in Black Scholes formula. Irrespective of whether any one understands Black Scholes replication argument or not, one can easily talk / trade / form opinions based on the implied volatility of an option. Issac Robinow helped create one such statistic for the insurance industry that was in vogue for atleast 50 years.

Despite the above achievements, Bayes’ was nowhere in prominence. The trio that was responsible for near death-knell of Bayes’ principle were Ronald Fisher, Egon Pearson and Jerzy Neyman

clip_image002[6] clip_image004[4] clip_image006

Ronald Aylmer Fisher

Egon Pearson

Jerzy Neyman

The fight between Fisher and Karl Pearson is a well known story in statistical world. Fisher accused Bayes’ theory saying that it was utterly useless. He was a flamboyant personality and a vociferous in his opinions. He single handedly created a ton of statistical concepts like randomization, degrees of freedom etc. The most important contribution of Fisher was that he made stats available to scientist and researchers who did not have the time / did not possess skills in statistics. His manual became a convenient way to conduct a scientific experiment. This was in marked contrast to Bayes’ principle where the scientist has to deal with priors and subjective beliefs. During the same time when Fisher was developing frequentist methods, Egon Pearson and Jerzy Neyman introduced hypothesis testing that helped a ton of people working in the labs to reject alternate or null hypothesis using a framework where the basic communication of a test result was through p values. 1920s- 1930s was the golden age for frequency based statistics. Considering the enormous influence of the above personalities, Bayes’ was nowhere to be seen.

However far from the enormous visibility of frequentist statisticians, there was another trio who were silently advocating and using Bayes’ philosophy. Emile Borel , Frank Ramsey and Bruno De Finetti.

 

clip_image008 clip_image010 clip_image012

Emile Borel

Frank Ramsey

Bruno De Finetti

Emile Borel was applying Bayes’ probability to insurance, biology, agriculture, physics problems. Frank Ramsey was applying the concepts to economics using utility functions. Bruno De Finetti was firmly of the opinion that subjective beliefs can be quantified at the race track. This trio kept Bayes’ alive in a few far removed circles of English statistical societies.

clip_image014 Harold Jeffrey

The larger credit however goes to the geo-physicist Harold Jeffrey’s who single handedly kept Bayes’ alive during the anti-Bayesian onslaught of 1930s and 1940s. His personality was quiet and gentlemanly and thus he could coexist with FisherJ . Instead of trying to distance away from Fisherian stats, Jeffrey was of the opinion that some principles like Maximum Likelihood function from Fisher’s armory were equally applicable to Bayesian Stats. However he was completely against p-values and argued against anyone using them. Statistically, lines were drawn. Jeffreys and Fisher, two otherwise cordial Cambridge professors embarked on a two-year debate in the Royal Society proceedings. Sadly the debate ended inconclusively and frequentism totally eclipsed Bayes’. By 1930s Jeffrey’s was truly a voice in the wilderness. So, this becomes the fourth death of Bayes’ theory.

 

Part II – Second World War Era

Bayes’ goes to War

The book goes on describe the work of Alan Turing who used Bayesian methods to locate the German U boats, breaking enigma codes in the Second World War.

clip_image002[8] Alan Turing

His work at Bletchley park along with other cryptographers validated Bayesian theory. Though Turing and others did not use the word Bayesian, almost all of their work was Bayesian in Spirit. Once the Second World War came to an end, the group was ordered to destroy all the tools built, manuscripts written, manuals published etc. Some of the important documents became classified info and none of the Bayesian stuff could actually be talked in public and hence Bayesian theory remained in oblivion despite its immense use in World War II.


Dead and Buried Again

Making the second world war work as classified information, Bayes’ was again dead. Till mid 1960s there was not a single article on Bayesian Stats that was readily available to scientists for their work. Probability was applied only to long sequence of repeatable events and hence frequentist in nature. Bayes theory’ was ignored by almost everyone by mid 1960s and a statistician during this period meant a frequentist who studies asymptotics and talked in p values, confidence intervals, randomization, hypothesis testing etc… Prior and Posterior distribution were terms that were no longer considered useful. Bayes’ theory was dead for the fifth time!

Part III – The Glorious Revival

Arthur Bailey

Arthur Bailey, an insurance actuary was stunned to see the insurance premiums were actually using Bayes’ theorem, something considered as a heresy in the schools where he learnt stuff. He was hell bent on proving the whole concept flawed and want to hoist the flag of frequentist stats on insurance premium calculations. After struggling for a year, he realized that Bayes’ was the right way to go. From then on, he massively advocated Bayes’ principles in insurance industry. He was vocal about it , published stuff and let everyone know that Bayes’ was infact a wonderful tool to price insurance premiums. Arthur Bailey’s son Robert Bailey also helped spread Bayesian stats by using it to rate a host of things. In time, the insurance industry accumulated enormous amount of data that Bayes’ rule , like the slide rule , became obsolete.


From Tool to Theology

Bayes’ stood poised for another of its periodic rebirths as three mathematicians Jack Good, Leonard Jimme Savage and Dennis V Lindley tackled the job of turning Bayes’ rule in to a respectable form of mathematics and a logical coherent methodology.

 

clip_image002[10] clip_image004[6] clip_image006[4]

Jack Good

Jimmie Savage

Dennis V Lindley

Jack Good being Turing’s wartime assistant knew the power of Bayes’ and hence started publishing and making Bayes’ theory known to a variety of people in Academia. However his work was still classified info. Hampered by governmental secrecy and his inability to explain his work, Good remained an independent voice in the Bayesian community. Savage on the other hand was instrumental in spreading Bayesian stats and making a legitimate mathematical framework for analyzing small data events. He also wrote books with precise mathematical notion and symbols thus formalizing Bayes’ theory. However the books did not become famous or were not widely adopted as computing machinery to implement ideas were not available. Dennis Lindley on the other hand pulled off something remarkable. In Britain he started forming small group of Bayesian circles and started pushing for Bayesian appointments in stats department. This feat in itself is something for which Lindley needs to be given enormous credit as he was taking on the mighty Fisher, in his own home turf, Britain.

Thanks to Lindley in Britain and Savage in US, Bayesian theory came of age in 1960s. The philosophical rationale of using Bayesian methods had been largely settled. It was becoming the only mathematics of uncertainty with an explicit, powerful and secure foundation in logic. “How to apply it? “, though remained a controversial question.


Jerome Cornfield, Lung Cancer and Heart Attacks

clip_image008[4] Jerome Cornfield was the next iconic figure who used Bayes’ to a real life problem. He successfully showed the linkage between smoking and lung cancer, thus introducing Bayes’ to epidemiology. He was probably one of the first guys to take Fisher and win arguments against Fisher. Fisher was unilaterally against Bayes’ and hence had written an arguments against Cornfield’s work. Cornfield with the help of his fellow statisticians proved all the arguments of Fisher baseless thus resuscitating Bayes’. Cornfields rose to head the American Statistician association and this in turn gave Bayes’ a stamp of legitimacy , infact a stamp of superiority over frequentist statistics. Or atleast it was thought that way.


There’s always a first time

frequentist stats, by its very definition cannot assign probabilities to events that have never happened. “What is the probability of an accidental Hydrogen bomb collision?” , This was the question on Madansky’s mind, a PhD student under Savage. As a final resort, he had to adopt Bayes theory to answer the question. The chapter shows that Madansky computed posterior probabilities of such an event and published his findings. His findings were taken very seriously and many cautious measures were undertaken. The author is of the view that Bayes’ thus averted many cataclysmic disasters.


46656 Varieties

There was an explosion in Bayesian stats around 1960s and Jack Good estimated that there 46,656 different varieties of definitions of Bayes’ that were prevalent. One can see that Statisticians were grappling with Bayes’ stats as it was not as straightforward as frequentist world. Stein’s paradox created further more confusion. Stein was an anti-bayesian and had published a paradox which looks like pro-bayes’ formula. Stein’s shrinkage estimator smelled like a Bayes formula but the creator did not credit Bayes for it. Thus there were quite a few problems in computing using Bayes’ framework. Most importantly was actually integrating a complex density form. This was 1960s and computing power eluded statisticians. Most of the researchers had to come up with tricks, nifty solutions to tide over laborious calculations. Conjugacy was one such concept that was introduced to make things tractable. Conjugacy is a property where prior and posterior share the same family of distribution thus making estimation and inference a little easier. 1960s and 1970s were the times when academic interest in Bayesian Statistics was in full swing. Periodicals, Journals, Societies started forming to work on, espouse, and test various Bayesian concepts. The extraordinary fact about the glorious Bayesian revival of the 1950s and 1960s is how few people in any field publicly applied Bayesian theory to real-world problems. As a result, much of the speculation about Bayes’ rule was moot. Until they could prove in public that their method was superior, Bayesians were stymied.

Part IV – To Prove its Worth

Business Decisions

Even though there were a ton of Bayesians around in 1960s, there were no real life examples to which Bayes’ had been used. Even for simple problems, computer simulation was a must. Hence devoid of computing power and practical applications, Bayes’ was still considered a heresy. There were a few who were gutsy enough to spend their lives in changing this situation. This chapter talks about two such individuals

clip_image002[12] Osher Schlaifer clip_image004[8] Howard Raiffa

These two Harvard professor set on a journey to use Bayes’ to Business Statistics. Osher Schlaifer was a faculty in the accounting and business department. He was randomly assigned to teach statistics course. Knowing nothing about it Schlaifer crammed away the frequentist stats and subsequently wondered about its utility in the real life. Slowly he realized that in business one always deals with a prior and develops better probabilities based on the data one gets to see in the real world. This logically bought him to Bayes’ world. The math was demanding and Schlaifer immersed in math to understand every bit of it. He also came to know about a young prof at Columbia, Howard Raiffa. He pitched to Harvard to recruit Raiffa and subsequently brought him over to Harvard. For about 7 years, both of them created a lot of practical applications to Bayesian methods. The professors realized that there was no mathematics toolbox for Bayesians to work and hence went on developing a lot of concepts which could make Bayesian computations easy. They also published books on applications of Bayesian statistics to management, detailed notes on Markov chains etc. They also tried inculcating Bayesian stuff in to curriculum. Despite these efforts, Bayes could not really permeate through academia for various reasons that are mentioned in this chapter. It is also mentioned that some students burnt the lecture notes in front of professor’s office to give vent to their frustration.


Who Wrote the Federalist?

clip_image006[6] Fred Frederick Mosteller

What is the Federalist Puzzle ? Between 1787 and 1788 , three founding fathers of the United States, Alexander Hamilton, John Jay and James Madison, anonymously wrote 85 newspaper articles to persuade New York State voters to ratify the American Constitution. Historians could attribute most of the essays but no one agreed whether Madison or Hamilton had written 12 others. Fredrick Mosteller at Harvard was excited by this classification problem and chose to immerse in the problem for 10 years. He roped in David Wallace from Chicago University in this massive research exercise. To this day, the Federalist is quoted as THE CASE STUDY that one can refer to, for the technical armory and depth of Bayesian statistics. Even though the puzzle exists to this day, it is the work that went behind it that showed the world that Bayesian stats was powerful in analyzing real life hard problems. One of the conclusions from the work was the prior did not matter. It’s the learning from the data that mattered. Hence it served as a good reminder to all the people who criticized Bayes’ philosophy because of its reliance of subjective priors. As an aside, the chapter brings out interesting elements of Fredrick Mosteller’s life and there are a lot of inspiring elements in the way he worked. The life of Mosteller is a must read for every statistician.


The Cold Warrior

clip_image008[6]
Tukey

This chapter talks about another missed opportunity for Bayes’ to become popular. The section profiles Tukey, sometimes referred to as Picasso of statistics. He seemed to have extensively used Bayesian stats in his military work, his election prediction work etc. However he chose to keep his work confidential, partly because he was obliged to, and partly out of his own choice. The author gives enough pointers to make the reader infer that Tukey was an out and out Bayesian

Three Mile Island

Despite its utility in the real world, 200 years later, people were still avoiding the word Bayes’. Even though everyone used the method in some form or the other, no one openly declared it. This changed after the civilian nuclear power plant collapse, a possibility that was very remote according to frequentist world, but Bayesians predicted it .This failure recognized that having subjective beliefs and working out posterior probability is not a heresy.

The Navy Searches

This is one of the longest chapters in the book and it talks about the ways in which Navy put to use Bayesian methods to locate lost objects like Hydrogen Missiles, Submarines etc. In the beginning even though Bayes’ was used, no one was ready to publicize the same. Bayes’ was still a taboo word. Slowly over a period of time when Bayes’ was used more effectively to locate things lying on the ocean floor, Bayes’ started to get enormous recognition in navy circles. Subsequently Bayes’ was being used to track moving objects, which also proved to be very successful. For the first time Monte Carlo methods were being extensively used to calculate posterior probabilities. This was at least 20 years before it excited academia. Also cute tricks like conjugacy were being quickly incorporated to calculate summaries of posterior probabilities.

Part V – Victory

Eureka

There were five near-fatal deaths to Bayes theory. Bayes had shelved it, Price published it but was ignored, Laplace discovered his own version but later favored frequency theory, frequentists virtually banned it and the military kept it secret. By 1980s data was getting generated at an enormous rate in different domains and hence statisticians and frequentists were faced the “curse of dimensionality” problem. Fischerian and Pearson methods applicable to data with a few variables were felt inadequate with the explosive growth of factors for the data. How to separate the signal from the noise? , became a vexing problem. Naturally the analyzed data sets in the academia were the ones with fewer variables. Contrastingly, others such as physical and biological scientists analyzed massive amounts of data about plate tectonics, pulsars, evolutionary biology, pollution, the environment, economics, health, education, and social science.Thus Bayes’ was stuck again because of its complexity.

In such a situation, Lindley and his student Adrian F Smith showed Bayesians the way to develop hierarchical models.

clip_image002[14] Adrian F. Smith

The hierarchical models fell flat in the face as the models were too specialized and stylized for many scientific applications. (It would be another 20 years before students began their preliminary understanding of Bayes’ by looking in to hierarchical models). Meanwhile, Adrian Raftery studied coal dust incident rates using Bayesian stats and was thrilled to publicize one of the Bayes’ strengths i.e evaluating competing models/hypothesis. Frequency-based statistics works well when one hypothesis is a special case of the other and both assumed gradual behavior. But when hypotheses are competing and neither is a special case of the other, frequentism is not as helpful, especially with data involving abrupt changes—like the formation of a militant union.

During the same period , 1985-1990 , image processing and analysis had become critically important for the military, industrial automation, and medical diagnosis. Blurry, distorted, imperfect images were coming from military aircraft, infrared sensors, ultrasound machines, photon emission tomography, magnetic resonance imaging (MRI) machines, electron micrographs, and astronomical telescopes. All these images needed signal processing, noise removal, and deblurring to make them recognizable. All were inverse problems ripe for Bayesian analysis. The first known attempt to use Bayes’ to process and restore images involved nuclear weapons testing at Los Alamos National Laboratory. Bobby R. Hunt suggested Bayes’ to the laboratory and used it in 1973 and 1974. The work was classified, but during this period he and Harry C. Andrews wrote a book, “Digital Image Restoration”, about the basic methodology; Thus there was an immense interest in image pattern recognition during these times.

Stuart Geman and Donald Geman were enthused on Bayes’ application to Pattern recognition after attending a seminar. They both went on to invent a technique called “Gibbs Sampling Method”

clip_image004[10] Stuart Geman clip_image006[8] Donald Geman

These were the beginning signs to the development of computational techniques in Bayes’. It was Smith who teamed up with Alan Gelfand and turned on the heat towards developing computational techniques. Ok , now a bit about Alan Gelfand.

clip_image008[8] Alan Gelfand

Gelfand started working on EM (Expectation Maximization) algo. Well, the first time I came across EM algo was in Casella’s Statistical Inference book. It was very well explained. However there was one missing part in such books that can only be obtained by reading around the subject, “the historical info”. This chapter mentions that tidbit about EM which will make learning EM more interesting. It says

EM algorithm, an iterative system secretly developed by the National Security Agency during the Second World War or the early Cold War. Arthur Dempster and his student Nan Laird at Harvard discovered EM independently a generation later and published it for civilian use in 1977. Like the Gibbs sampler, the EM algorithm worked iteratively to turn a small data sample into estimates likely to be true for an entire population.

Gelfand and Smith saw the connection between Gibbs Sampler, Markov chains and Bayes’ methodology. They recognized that Bayes formulations, especially the integrations involved in Bayes’ could be successfully solved using a mix of Markov chains and Simulation techniques. When Smith spoke at a workshop in Quebec in June 1989, he showed that Markov chain Monte Carlo could be applied to almost any statistical problem. It was a revelation. Bayesians went into “shock induced by the sheer breadth of the method.” By replacing integration with Markov chains, they could finally, after 250 years, calculate realistic priors and likelihood functions and do the difficult calculations needed to get posterior probabilities.

While these developments were happening in statistics, an outsider might feel a little surprised at the slowness of the application of montecarlo methods to statistics as they were being used by Physicists from as early as 1930s. Fermi , a Nobel prize winning physicist was using Markov chains to study nuclear physics. He could not publish his stuff as it was classified info. However in 1949, Maria Goeppert, a physicist and a future Nobel winning scientist gave a public talk on Markov chains + simulations and its application to real world problems.

clip_image010[4] Nicholas Metropolis

That same year Nicholas Metropolis, who had named the algorithm Monte Carlo for Ulam’s gambling uncle, described the method in general terms for statisticians in the prestigious Journal of the American Statistical Association. But he did not detail the algorithm’s modern form until 1953, when his article appeared in the Journal of Chemical Physics, which is generally found only in physics and chemistry libraries. Bayesians ignored the paper. Today, computers routinely use the Hastings–Metropolis algorithm to work on problems involving more than 500,000 hypotheses and thousands of parallel inference problems. Hastings was 20 years ahead of his time. Had he published his paper when powerful computers were widely available, his career would have been very different. As he recalled, “A lot of statisticians were not oriented toward computing. They take these theoretical courses, crank out theoretical papers, and some of them want an exact answer.” The Hastings–Metropolis algorithm provides estimates, not precise numbers. Hastings dropped out of research and settled at the University of Victoria in British Columbia in 1971. He learned about the importance of his work after his retirement in 1992.

Gelfand and Smith published their synthesis just as cheap, high-speed desktop computers finally became powerful enough to house large software packages that could explore relationships between different variables. Bayes’ was beginning to look like a theory in want of a computer. The computations that had irritated Laplace in the 1780s and that frequentists avoided with their variable-scarce data sets seemed to be the problem—not the theory itself.

The papers published by Gelfand and Smith were considered as landmark in the field of Statistics. They showed that Markov chains simulations applied to Bayes could basically solve any frequentist problem and more importantly, many other problems. This method was baptized as Markov chain Monte Carlo, or MCMC for short. The combination of Bayes and MCMC has been called “arguably the most powerful mechanism ever created for processing data and knowledge.”

After this MCMC birth in 1990s, statisticians could study data sets in genomics or climatology and make models far bigger than physicists could ever have imagined when they first developed Monte Carlo methods. For the first time, Bayesians did not have to oversimplify “toy” assumptions. Over the next decade, the most heavily cited paper in the mathematical sciences was a study of practical Bayesian applications in genetics, sports, ecology, sociology, and psychology. The number of publications using MCMC increased exponentially.

Almost instantaneously MCMC and Gibbs sampling changed statisticians’ entire method of attacking problems. In the words of Thomas Kuhn, it was a paradigm shift. MCMC solved real problems, used computer algorithms instead of theorems, and led statisticians and scientists into a world where “exact” meant “simulated” and repetitive computer operations replaced mathematical equations. It was a quantum leap in statistics.

Appreciation of the people behind techniques is important in any field. The takeaway from this chapter is that MCMC’s birth credit goes to

clip_image011 Adrian F. Smith + clip_image012[4] Alan Gelfand

Bayes’ and MCMC found its application in genetics and a host of other domains. MCMC took off and Bayes was finally vindicated. One aspect that was still troubling people at large was the availability of software to work the computations. MCMC , Gibbs Sampler, Metropolis-Hastings all were amazingly good concepts that needed some general software that could be used for the same. The story gets interesting with the contribution from David Spiegelhalter

clip_image014[4] David Spiegelhalter

Smith’s student, David Spiegelhalter, was working in Cambridge at the Medical Research Council’s biostatistics unit. He had a rather different point of view about using Bayes’ for computer simulations. Statisticians had never considered producing software for others to be part of their jobs. But Spiegelhalter, influenced by computer science and artificial intelligence, decided it was part of his. In 1989 he started developing a generic software program for anyone who wanted to use graphical models for simulations. Spiegelhalter unveiled his free, off-the-shelf BUGS program (short for Bayesian Statistics Using Gibbs Sampling) in 1991.

Ecologists, sociologists, and geologists quickly adopted BUGS and its variants, WinBUGS for Microsoft users, LinBUGS for Linux, and OpenBUGS. Computer science, machine learning, and artificial intelligence also joyfully swallowed up BUGS. Since then it has been applied to disease mapping, pharmacometrics, ecology, health economics, genetics, archaeology, psychometrics, coastal engineering, educational performance, behavioral studies, econometrics, automated music transcription, sports modeling, fisheries stock assessment, and actuarial science.

A few more examples of Bayes formal adoption mentioned are:

  • Federal Drug Administration (FDA) allows the manufacturers of medical devices to use Bayes’ in their final applications for FDA approval.
  • Drug companies use WinBUGS extensively when submitting their pharmaceuticals for reimbursement by the English National Health Service.
  • The Wildlife Protection Act was amended to accept Bayesian analyses alerting conservationists early to the need for more data.
  • Today many fisheries journals demand Bayesian analyses.
  • Forensic Sciences Services in Britain

 

Rosetta Stones

The last chapter is a recount of various applications that Bayes’ has been used successfully. This chapter alone will motivate anyone to keep a Bayesian mindset while solving real life problems using statistics.

Bayes has broadened to the point where it overlaps computer science, machine learning, and artificial intelligence. It is empowered by techniques developed both by Bayesian enthusiasts during their decades in exile and by agnostics from the recent computer revolution. It allows its users to assess uncertainties when hundreds or thousands of theoretical models are considered; combine imperfect evidence from multiple sources and make compromises between models and data; deal with computationally intensive data analysis and machine learning; and, as if by magic, find patterns or systematic structures deeply hidden within a welter of observations. It has spread far beyond the confines of mathematics and statistics into high finance, astronomy, physics, genetics, imaging and robotics, the military and antiterrorism, Internet communication and commerce, speech recognition, and machine translation. It has even become a guide to new theories about learning and a metaphor for the workings of the human brain.

Some of the interesting points mentioned in this chapter are:

  • In this ecumenical atmosphere, two longtime opponents—Bayes’ rule and Fisher’s likelihood approach—ended their cold war and, in a grand synthesis, supported a revolution in modeling. Many of the newer practical applications of statistical methods are the results of this truce.
  • Bradley Efron, the man behind bootstrapping, admitted that he had always been a Bayesian.
  • Mathematical game theorists John C. Harsanyi and John Nash shared a Bayesian Nobel in 1994.
  • Amos Tversky, 2002 Nobel Prize winner thought through Bayesian methods, though he reported the findings in frequentist terms
  • Crash courses in Bayesian concepts are being offered in Economics depts. at all Ivy League schools
  • Reniassance uses Bayesian approaches heavily for portfolio management and technical trading. Portfolio manager , Robert L Mercer states that ,” “RenTec gets a trillion bytes of data a day, from newspapers, AP wire, all the trades, quotes, weather reports, energy reports, government reports, all with the goal of trying to figure out what’s going to be the price of something or other at every point in the future. We want to know in three seconds, three days, three weeks, three months.The information we have today is a garbled version of what the price is going to be next week. People don’t really grasp how noisy the market is. It’s very hard to find information, but it is there, and in some cases it’s been there for a long, long time. It’s very close to science’s needle-in-a-haystack problem.”
  • Bayes’ has found a comfortable niche in high-energy astrophysics, x-ray astronomy, gamma ray astronomy, cosmic ray astronomy, neutrino astrophysics, and image analysis.
  • Biologists who study genetic variation are limited to tiny snippets of information reflect that Bayes is the manna from the heavens for such problems
  • Sebastian Thrun of Stanford built a driverless car named Stanley. The Defense Advanced Research Projects Agency (DARPA) staged a contest with a 2 million prize for the best driverless car; the military wants to employ robots instead of manned vehicles in combat. In a watershed for robotics, Stanley won the competition in 2005 by crossing 132 miles of Nevada desert in seven hours.
  • On the Internet Bayes has worked its way into the very fiber of modern life. It helps to filter out spam; sell songs, books, and films; search for web sites; translate foreign languages; and recognize spoken words.
  • A 1-million contest sponsored by Netflix.com illustrates the prominent role of Bayesian concepts in modern e-commerce and learning theory.
  • Google also uses Bayesian techniques to classify spam and pornography and to find related words, phrases, and documents.
  • The blue ribbons Google won in 2005 in a machine language contest sponsored by the National Institute of Standards and Technology showed that progress was coming, not from better algorithms, but from more training data. Computers don’t “understand” anything, but they do recognize patterns. By 2009 Google was providing online translations in dozens of languages, including English, Albanian, Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, Estonian, Filipino, Finnish, and French.

What’s the future of Bayes ? In author’s words

Given Bayes’ contentious past and prolific contributions, what will the future look like? The approach has already proved its worth by advancing science and technology from high finance to e-commerce, from sociology to machine learning, and from astronomy to neurophysiology. It is the fundamental expression for how we think and view our world. Its mathematical simplicity and elegance continue to capture the imagination of its users.

The overall message from the last chapter is that Bayes’ is just starting and it will have a tremendous influence in the times to come.

 

clip_image016Takeaway:

Persi Diaconis , a Bayesian at Stanford says , "Twenty-five years ago, we used to sit around and wonder, ‘When will our time come?’ . Now we can say: ‘Our time is now.’ "

Bayes’ is used in many problem areas and this book provides a fantastic historical narrative of the birth-death process that Bayes’ went through before it was finally vindicated. The author’s sketch of various statisticians makes the book an extremely interesting read.