May 2013




Right from the preface of the book, Prof. David Williams emphasizes that intuition is more important than rigour. The definition of probability in terms of long term frequency is fatally flawed and hence the author makes it very clear in the preface that “probability works only if we do not define probability in the way we talk about probability in the real world”. Meaning colloquial references to probability gives rise to shaky foundations. However if you build up probability theory axiomatically then the whole subject is as rigorous as Group theory. Statistics in the modern era is vastly different from yesteryears. Computers have revolutionized the application of statistics to real life problems. Most of the modern problems are solved by applying Bayes’ formula via MCMC packages. If this statement is surprising to you, then you should definitely read this book. Gone are the days when statisticians used to refer to some table of distributions, p-values etc. to talk about their analysis.In fact in the entire book of 500 odd pages, there is only about 15 pages of content on hypothesis testing and that too with a title “Hypothesis testing, if you must”. Today one of the critical ingredients of a statistician’s tool box is MCMC. Ok, let me attempt to summarize this book.

Chapter 1: Introduction

The author states that the purpose of the book is only to provide sufficient of a link between Probability and Statistics to enable the reader to go over advanced books on Statistics, and to show also that “Probability in its own right is both fascinating and very important”. The first chapter starts off by stating the two most common errors made in studying the subject. The first common mistake when it comes to actually doing calculations is to assume independence. The other common error is to assume that things are equally likely when they are not. The chapter comprises a set of examples and exercises to show the paradoxical nature of probability and mistakes that we can make if we are not careful in understanding the subject properly. The Two Envelopes Paradox illuminates the point that measure theoretic understanding of the subject is critical to understanding the subject. The idea of probability as a long-term relative frequency is debunked with a few examples that motivate the need for understanding measure theory.

This chapter is a charm as it encourages readers to embrace measure theory results in a manner that helps intuition. It says that by taking a few results from measure theory for granted, we can clearly understand the subject and theorems. The author says that “Experience shows that mathematical models based on our axiomatic system can be very useful models for everyday things in the real world. Building a model for a particular real-world phenomenon requires careful consideration of what independence properties we should insist that out model have”.

One of the highlights of the book is that it explains the linkage between probability and statistics in an extremely clear way. I have not come across such a lucid explanation of this linkage anywhere before. Let me paraphrase the author here,

In Probability, we do in effect consider an experiment before it is performed. Numbers to be observed or calculated from observations are at that stage Random Variables to be observed or calculated from observations – what I shall call Pre-Statistics, rather nebulous mathematical things. We deduce the probability of various outcomes of the experiment in terms of certain basic parameters.

In Statistics, we have to infer things about the values of the parameters from the observed outcomes of an experiment already performed. The Pre Statistics have now been crystallized in to actual statistics, the observed number or numbers calculated from them.

We can decide whether or not operations on actual statistics are sensible only by considering probabilities associated with the Pre-Statistics from which they crystallize. This is the fundamental connection between Probability and Statistics.

The chapter ends with the author reiterating the importance of computers in solving statistical problems. Not always analytical closed forms are desired. Sometimes they are not even possible, so one has to resort to simulation techniques. May be 20 years back, it was ok to somehow do statistics without much exposure to statistical software, not any more. Computers have played a fundamental role in the spread and understanding of statistics. Look at the rise of R and you come to the inevitable conclusion that “statistical software skills are becoming as important as modeling skills”.

Some of the quotes by the author that I found interesting are

  • The best way to learn how to program is to read programs, not books on programming languages.
  • In regard to Statistics, common sense and scientific insight remain, as they have always been, more important than Mathematics. But this must never be made an excuse for not using the right Mathematics when it is available.
  • Intuition is more important than rigor, though we do need to know how to back up intuition with rigor.
  • Our intuition finds it very hard to cope with sometimes perverse behavior of ratios
  • The likelihood geometry of the normal distribution, the most elegant part of Statistics, has been used to great effect, and remains an essential part of Statistical culture. What was a good guide in the past does not lose that property because of advances in computing. But, modern computational techniques mean that we need no longer make the assumption of normally distributed errors to get answers. The new flexibility is exhilarating, but needs to be used wisely

Chapter 2: Events and Probabilities

The second chapter gives a whirl wind tour of measure theory. After stating some basic axioms of probability like addition rule and inclusion-exclusion principle, the chapter goes on to define all the important terms/theorems of measure theory. This chapter is not a replacement for a thorough understanding of measure theory. However, if one does not want to plough through measure theory AT ALL, but wants to have just enough idea to get going, then this chapter is useful. For readers who already know measure theory, this chapter serves as a good recap of the subject that is necessary to understand statistics. The concepts introduced and stated in this chapter are Monotone Convergence Properties, Almost surely, Null sets, Null events, Borel-Cantelli Lemma, Borel Sets and functions, Pi System lemma. A good knowledge of these terms is priceless in understanding stats properly. Banach-Tarski Paradox is also stated so that one gets a good idea of the reason behind leaving out non measurable sets for constructing event space.

Chapter 3: Random Variables, Means and Variances

The tone of the book is sometimes funny. The author at the beginning of the chapter says

This chapter will teach you nothing about probability and statistics. All the methods described for calculating mean and variances are dull and boring and they are better efficient indirect methods described in the book in the later chapters.

Random variables are basically measurable functions and the reason for choosing Lebesgue/ Borel measurable functions is that it is hard to escape from the measurable world after you perform additions/subtractions/ multiplications/ taking limits on these functions etc. The author introduces a specific terminology in the book that helps the reader to understand the linkage between Statistics and Probability. A Pre-Statistic is a special kind of Random variable: a Random variable Y is a Pre-Statistic if the value Y(w_actual) will be known to the observer after the experiment is performed. Y(w_actual) becomes the observed value of Y. An example mentioned in the book which brings out the difference:

Let’s say you want to know the width of an object and let’s say you have 100 different observations for the width of the object. You can calculate the mean and variance of residual. But to crystallize these, one needs to study the Pre-Statistics M and RSS , know their distribution so one can compute confidence intervals. So, the Pre-Statistics become important and are a part of Probability. The inference of the parameters form a part of Statistics.

The chapter then talks about distribution functions, probability mass functions and probability distribution functions. Expectation of random variables is introduced via Simple random variables and non negative random variables. I really liked the discussion provided about supremum to connect ideas between expectation of general non negative random variable and simple non negative random variable . By going through the discussion, I am able to verbalize the ideas in a better way. Wherever measure theory is used, it is emphasized with letter M so that a disinterested reader can skip the content. However it is very important for any student of statistics to know the connection between integration and expectation. Lebesgue integral of a non negative measurable function is defined in terms of supremum of integral of simple random variables. This is where the connection to integration appears.

The chapter quickly introduces L1 space, L2 space and shows how various random variables fit in. The mean of a random variable can be computed if the random variable lies in L1 space whereas variance makes sense for a variable in L2 space. The fact that L1 space is a subset of L2 space and so on and so forth is a deep result, a result which becomes clear after having some knowledge of functional analysis, normed vector spaces, metric spaces, etc. So, my feeling is that a reader who has never been exposed to these spaces like Hilbert spaces would fail to appreciate the various aspects mentioned in this chapter. There is no alternative but to slog through the math from some other book to understand these concepts. However this chapter serves as a wonderful recap of functional spaces. The fact that the variance of a standardized random variable lies in L2 space needs to be understood one way or the other, be it through slog or through right kind of intuition.

Chapter 4: Conditioning and Independence

The chapter starts with these words

Hallelujah! At last, things can liven up.

The author is clearly kicked about the subject and more so about conditioning. The author’s enthusiasm might be infectious too. At this point in the book, any reader would start looking forward to the wonderful world of conditioning.

The chapter introduces the basic definition of conditional probability, states Bayes’ theorem, gives the famous Polya’s urn example to illustrate the use of conditioning. Polya’s urn is described as “a thing of beauty and joy forever”, meaning the Polya’s urn model is very rich mathematically. Indeed, there are books on probability models where all the models have Polya’s urn scheme as the spine. Look at this book Polya Urn Models , where innumerable models are discussed with Polya Urn in the background. Simple examples of Bayesian estimation is followed up with a rather involved example of Bayesian Change Point detection.

I think the section on genetics and the application of conditional probability has been the biggest take away from this chapter and probably from the entire book. I am kind of ashamed to write here that I had not read any books by Richard Dawkins, one of the greatest living scientific writers. Sometimes I am amazed at my ignorance on various things of life and then I am reminded of the innumerable activities that I had wasted my time on, in the past. Well, at least I am glad that I now recognize the value of time in a better way. Anyways the author strongly advocates the reader to go over at least a few books by Dr Richard Dawkins. Thanks to the torrent sites, all the books by Richard Dawkins are available for free. I have downloaded a couple and have put them on my reading list. At least I should read “The selfish gene” and “The Blind watch maker” soon. Apart from the fantastic theory, there is a tremendous scope to see the application of conditional probability to the field of genetics. I hope to understand these aspects as I keep working on honing my probability skills. Ok, coming back to the summary, the chapter then introduces the concept of independence , Borel Cantelli Lemma and shows through an example the connection between lim sup and Borel Cantelli Lemma. There are some exercises in the chapter that are challenging. “Five Nation problem” is something that I found interesting. It goes like this : If 5 nations play a sport against each other , i.e total 10 games , what is the probability that each nation wins 2 games? The answer is 3/128 and it is not immediately obvious. One needs to think a bit to get to the solution. I worked out using a simulation and then solved it analytically.

I knew the way to calculate the pdf of order statistics using a rather tedious approach. Crank out the cumulative cdf of IID variables, differentiate and prove that the order statistic follows a beta distribution. Williams explores the same statistic and proves intuitively but rigorously without resorting to calculus. I had persevered to understand the calculus approach earlier and hence this intuitive approach was a pleasant surprise that I will forever remember. The order statistics for the median of a uniform random sample is calculated and it is immediately followed up with an example. This is the recurrent pattern of the book. Whenever you see some statistic being estimated, in this case the median of the n iid uniform rvs, it is followed up with the usage of this median statistic .As one is aware, Cauchy distribution has no mean. By merely observing N Cauchy variable, mean will not help you estimate the parameter. It is the median that comes to your rescue. The median is likely to be close to the parameter in the Cauchy distribution.

The section of Law of Large numbers has an interesting example called the Car Convey problem which shows that a simple probability model can have a sophisticated mathematical underpinning.This example is a great application of Borel Cantelli Lemma and I hope to forever remember. Basic inequalities like Tchebyshev’s and Markov are introduced, the former being used in the derivation of weak law of large numbers.

Every experiment no matter how complicated it is, can be reduced to choosing a number from the uniform distribution (0,1). This fact is intuitively shown using Borel’s Strong Law for coin tossing. Kolmogorov’s Strong Law of Large numbers is stated and proved in an intuitive manner with less frightening measure theory math. In the end it is amply clear that the SLLN deals with almost sure convergence and WLLN deals with convergence in probability. This is whether things tie in. In an infinite coin tossing experiment it is the SLLN that is invoked to show that the proportion of heads almost surely converges to ½ and there are many sequences which do not converge to ½ but whose measure is 0. So, Convergence in probability is weak in the sense that average might not converge to mean at all but the probability that it converges to mean can be 1. The author then states the “Fundamental Theorem of Statistics”, (Glivenko- Cantelli) that talks about the convergence of sample distribution function to the population distribution function.Basically this is reason why bootstrapping theory works. You compute the ecdf and work with it for all sorts of estimation and inference aspects. The section also mentions a few examples which show that weak law is indeed weak -J , meaning the long term average can significantly differ from mean , but at the same time the long term average converges in probability to the mean.

The chapter then introduces Simple Random walks. This section is akin to compressing Chapter 3 of Feller Volume 1, stripping all the math and explaining the intuition behind reflection principle, hitting and return probabilities, etc. There is a section titled “Sharpening our intuition” which was highly illuminating. Some of the formulas derived in Feller show up in this section with the author showing the calculations behind hitting time and return time probabilities. Reflection Principle and its application to Ballot Theorem are also mentioned. As I go over these examples, I tend to remember statements made by Taleb, Rebanato, etc. who warn the fact that world of finance cannot be modeled using coin tossing models!.At the same time I think if it improves my intuition about randomness, I would not care whether these principles are useful in finance or not. Well, certainly I would love to apply these things in finance, but if my intuition improves in the process of working through them, that’s good enough. Anyways I am digressing from the intent of the post, i.e to summarize.

Some of the problems involving random walks are solved by forming difference equations. These difference equations inevitably contain a property that process starts afresh after a certain event. For example, if you are looking at gambler’s ruin problem ( you start with a capital a and you win if you can get to b before you hit 0), then after your first win, the gambler is facing the same problem as before except that you now have a capital of a+1 and you will win if you get to b or lose if you hit 0. In such examples where intuition yields correct results, one must also understand the details behind this “process starting afresh”. This topic is explored in a section title “Strong Markov Principle”. To understand Strong Markov Principle, one needs to understand Optional Stopping time. The highlight of this book as mentioned earlier, is that things are explained intuitively with just the right amount of rigour. Optional Stopping time is introduced in texts such as Shreve or some Math Finance Primer using heavy math. Books such as this need to be read by any student so that he /she gets the intuition behind optional stopping time. Williams describes this problem by asking a simple question, “For what kind of random time will the process start afresh”? This means that there are certain random variables, basically time variables that can be called as stopping Times if T<=n can be calculate from the variables X1,X2,…Xn. Generally Stopping time is the first time that something observable happens. With this definition, one can look at what sort of events that can be considered so that T can be stopping time for such variables. One must carefully understand this stuff that T is a stopping time for a process X under certain conditions. The book subsequently states “Strong Markov Principle” that can then be used in situations to justify that process starts again.

The chapter ends with a note on describing various methods to generate IIDS. Various random generators are explored like multiplicative, mixed multiplicative and congruential generators. Frankly I am not really motivated to understand the math behind these generators as most of them are coded in statistical packages. Well, whether it uses some complicated number theory properties or some trick is not something I want to spend time on. So quickly moved to the end of this section where Acceptance Sampling is described. Well, inverting a cumulative CDF and getting a random univariate from a distribution is not always possible as inverse for a CDF might not have a closed form. It is precisely to solve such a type of problem that accept-reject sampling procedure is tremendously useful. For a reader who is coming across the method for the first time, he/she typically wonders why it works. Ideally one must get the intuition right before heading in to the formal derivation of the procedure. The intuition as I understand from the book is as follows :

  • It is hard to simulate samples from a complicated pdf lets say f(x)
  • Choose a pdf which is analytically tractable and in such a way that that the scaled version of this pdf and the original pdf have same support values,lets call it g(x)
  • Use a criterion to accept the sample value from g(x) or reject it. – The logic here is : let’s say the ratio of f(z) / scaled g(z) 0.6 for z1 and 0.1 for z2, then one must prefer z1 as compared to z2. This can be made concrete by simulating a random number and comparing it with f(z) / scaled g(z) and then subsequently accepting or rejecting the simulated value.

The key is to understand the criterion which is in an inequality form. Obviously there are cases where Accept-Reject methods fail. In such situations, Gibbs sampling or MCMC methods are used. There is a lot of non-linearity in going about understanding statistics. Let’s say you are kicked about MCMC and want to simulate chains, you got to learn a completely new software and syntax ( WinBUGS). If you want to use R to further analyze the output, you got to use BRugs package. I just hope my non-linear paths become helpful in the overall understanding and don’t end up being unnecessary deviations from learning stuff.

Chapter 5: Generating Functions and Central Limit Theorem

Sum of random variables is something that one comes across in umpteen cases. So, it is imperative to know the joint distribution or joint pmf for the sum of random variables. The chapter talks about probability generating function (PGF), Moment generating function (MGF), Characteristic function(CF), Laplace transformation(LT) and Cumulant function(C).

Probability generating function is akin to signature of a random variable where one function can summarize the probabilities of various realizations. The key theorem that follows from these functions are – If the generating function for a sequence of random variables converges to the generating function of the limit Random variable, then probabilities also converge.

A pattern is followed in this chapter. For each of the 5 functions (probability generating function PGF, moment generating function MGF, characteristic function CF, Laplace transformation LT, Cumulant function), the following are proved

  • Convergence of PGF/MGF/CF/LT/C for a sequence of variables,X_n to a PGF/MGF/CF/LT/C of X means distribution function for the sequence of X_n converges to the distribution function of X
  • Independence of RV means one can multiply the PGF/MGF/CF/LT/C for each of the RVs
  • Uniqueness : If PGF/MGF/CF/LT/C for two variables match then the distributions for the two variables match

CLT, the workhorse of statistics is stated and proved using cumulant generating function approach. Integer correction form of CLT is stated for discrete random variables.

Chapter 6: Confidence Intervals for one parameter models

The chapter starts off with a basic introduction to frequentist view of point estimation. Using the Pre-Statistics, one can form the confidence intervals and then after the experiment is performed, the confidence intervals are computed. The point to note is that one makes a statement on the confidence interval and does not say anything about the parameter as such. In Williams’ words

A frequentist will say that if he were to repeat the experiment lots of times, each performance being independent of the others and announce after each performance that, for that performance, parameter belongs to a specific confidence interval, then he would be right about C% of the time and would use THAT as his interpretation of the statement that on the one occasion that the experiment is performed with result yobs, the interval as a function of yobs is a C% Confidence Interval.

The chapter then introduces the concept of likelihood function. The immediate relevance of this function is shown by the introduction of sufficient statistics. Sufficient statistics are very important in the sense that the sufficiency statistic is a lower dimension version of the random sample. How do you check whether the statistic is a sufficient statistic? One can use the factorization principle / criterion for sufficiency mentioned. One should be able to split the likelihood function in to two functions one that is independent of the parameter and other a function of sufficient statistic. If one is able to accomplish this, one can call this a sufficient statistic. An extremely good way to highlight the importance of sufficient statistic is given in the book, which is worth reiterating:

Let t be a sufficient statistic and let T be the PreStatistic which crystallizes to t. Let fT be the conditional density of T|theta when the true value of the parameter is theta. Then Zeus (father of gods) and Tyche can produce IID RVs Y1,Y2,…Yn each with pdf f(.|theta) as follows

  • Stage 1: Zeus picks T according to the pdf fT. He reports the chosen value of T but NOT the value of theta , to Tyche
  • Stage 2: Having been told the value of T, but not knowing the value of theta, Tyche performs an experiment which produces the required Y1,Y2,…Yn

Thus sufficient statistic tells everything that is needed to know about the sample.

Estimator is basically a function of the realized sample. This is what I vaguely remember having learnt. However the section makes it very clear that the estimator is a Pre Statistic and hence is a function of n random variables. One can study the pdf, pmf and everything related to it using probability. However once the experiment is performed, the estimator leads to an estimate. One of the basic criteria required for estimators is that it should be unbiased, meaning the expectation of the estimator given parameter theta is theta. However there can be many unbiased estimators that satisfy the above definition. What is the method to choose amongst the unbiased ones? Here is where the concept of efficiency comes in. Any unbiased estimator has a minimum variance bound defined by the inverse of Fischer Information ratio. If you have a bunch of estimators and one of them attains the minimum bound, then you can stop your search procedure and take the estimator as MVB unbiased estimator. There is a beautiful connection between Fischer Information and Variance of an estimator that is evident in Cramer-Rao Minimum Variance Bound. The inverse of Fischer information thus becomes the variance of the estimator. This section explains the difference between an Unbiased Estimator & MVB Unbiased Estimator for a parameter.

MLE are introduced not for its own sake but to throw light on confidence intervals. Firstly MLE is an estimator and like any estimator, it is a pre-statistic. The way MLE is connected to confidence intervals is through the fact that MLE is both MVB unbiased and approximately normal. This enables one to form Confidence intervals for the parameter pretending that it has a posterior density of a normal distribution where the variance is related to Fisher information ratio. The section then moves on to prove the consistency of MLE whereby as sample size increases the MLE estimator converges to the true value of the parameter. To appreciate the connection between MLE approximating to normal and forming confidence bands, Cauchy distribution is used. Typically Cauchy distribution has no variance and hence for large n, the usual methods are useless. MLE based confidence interval can be formed for a crazy distribution like Cauchy.

To prove the consistency of MLE, the concept of entropy and relative entropy is introduced. Relative entropy gives a measure of how badly a function f is approximated by g. This is also called Kullback-Leibler relative entropy. The connection between entropy and normal distribution is shown and subsequently the importance of normal distribution in Central Limit theorem is highlighted.

The chapter then goes on to cover Bayesian confidence intervals. The basic formula applicable for Bayesian statistics is that posterior is prior times likelihood. How does one choose priors ? What are the different kinds of priors ?

  • Improper priors : These priors cannot be normalized to a proper pdf
  • Vague Priors
  • Reference priors : Priors chosen based on Fischer information
  • Conjugate priors : The prior is chosen in such a way that the prior times likelihood becomes a prior distribution with different parameters

If one were to make any sensible inferences about parameters, then

  • Either one must have reasonable good prior information, which one builds in to one’s prior
  • or if one has only vague prior information, one must have an experiment with sample size sufficient to make one’s conclusions rather robust to changes in one’s prior

Whenever one chooses prior there is always a possibility that prior and likelihood are conflict arises where in the prior times likelihood is always so small and so the posterior density is something close to 0/0 . This is the main reason for the lack of robustness. The author gives an overview of Hierarchical modeling. He also mentions the famous Cromwell’s dictum and considers an application where a mixture of conjugates is taken and imaginary results are used to select a decent prior. Cromwell’s dictum basically says that one’s prior density should never be literally zero on a theoretically possible range of parameter values. One must regard it as a warning to anyone who chooses a prior density that is too localized and which is therefore extremely small on a range of parameter values which we cannot exclude. So, overall the section on Bayesian confidence intervals compresses the main results of Bayesian statistics in 20 odd pages. It distills almost every main feature of Bayesian stats.

In a book that is ~ 500 odd pages, Hypothesis testing takes up ~20 pages and that too with the title ,”Hypothesis testing – if you must”. This is mainly to remind readers that Hypothesis testing is a tool that has become irrelevant and the way to report estimates is through Confidence intervals.

CIs usually provide a much better way of analyzing data

With this strong advice against using hypothesis testing, the author does give all the main aspects of Hypothesis testing. The story as such is straightforward to understand. The MOST important test in frequentist world is Likelihood ratio test. In the frequentist world, Null and Alternate hypothesis are not placed on equal footing. Null is always “innocent unless proven guilty”. The idea behind LR test is this : Divide the parameter space in to two, as per the hypothesis. Compute the max likelihood of the data given the alternate parameter space and max likelihood of the data given the null parameter space, take the ratio of the two. Reject Null if the ratio is high. Sensible approach. But this strategy can reject null when infact null is true. So, new characters enter in to the picture. First is power function that evaluates the probability of rejection region given a parameter value. Ideally an experimenter would want power function to take 0 on null and 1 on alternate parameter space. Since this is too much to ask for, he sets close to 0 value for beta function in the null parameter space and close to 1 for the beta function in the alternate parameter space. This type of selecting values for beta function is a way to control Type I and Type II error. Having done, the experimenter computes the sample size of the experiment and the critical value of the LR test. There is a nice connection to chi square distribution that helps us quickly get through all this. Deviance is defined to be twice the logarithm of Likelihood ratio and this turns to be chi squared distributed. Thus one can directly use deviance to accept or reject null. Having given all these relevant details, the author reiterates the point that it is better to cite Confidence intervals, HDIs than doing hypothesis testing. In that sense, Bayesian stats is definitely appealing. But the problem with Bayes is priors. Where do we get them? The chapter ends with a brief discussion of model selection criteria.

Chapter 7: Conditional Pdfs and Multi-Parameter Bayes

The chapter starts off by defining joint pdfs/ conditional pdf/ joint pmf/ conditional pmf . It then shows a geometric derivation of Jacobian. Transformation of random variables is then introduced so that one can calculate the joint density of the transformed variable. One point which is often missed in the intro level probability books is that , when we talk about pdf , it is “a pdf” and not “the pdf” because one can define many pdfs which agree on all points except on a set of measure 0(Subtle point to understand0. For a novice, one might have to go through measure theory to understand. The section then derives Fischer and t distributions using transformation of appropriate random variables. Simulation of values from gamma and t distributions are discussed in detail. The following points summarize the interconnections


Frankly I speed read this part, becoz , well all it takes to generate a gamma rv is rgamma function and t rv is rt from R functions. Somehow when I first came across pseudo random numbers I was all kicked to know about the gory details about it, but then as time passed , I have developed an aversion to go in too much depth about these random number generators. As long as there is standard library that can do the job, I don’t care how it’s done. The chapter then introduced Condition pdf and marginal pdf to prepare ground for multiparameter Bayesian statistics.

If one needs to estimate mu and sigma for the data, one can form priors for each of these parameters, invoke the likelihood function and then compute the posterior pdf can be written down. However the expression is rather complicated. Conditional on mu, the expression is tractable and so also conditional on sigma the expression is tractable.

The idea of the Gibbs sampler , as for all MCMCs is that being given a sample with empirical distribution close to a desired distribution is nearly as good as being given the distribution itself

The last section of this chapter is about Multi-Parameter Bayesian Statistics. The content is extremely concise and I found it challenging to go over some of the content, especially towards the content that mentions Bayesian-Frequentist conflict. In a typical posterior density computation, the biggest challenge is the constant that needs to be tagged along with prior times likelihood function.The constant of proportionality is very hard to evaluate even numerically. This is where Gibbs sampling comes in. It allows one to get to the posterior density by simulation. The idea of Gibbs sampler is that being given a sample with Empirical Distribution close to a desired distribution is nearly as good as being given the distribution itself. In a posterior pdf that usually looks very complicated, if you condition on all the parameters except one, you get a tractable conditional density. So, the logic behind Gibbs sampling is : You condition on all the parameters except the first one, obtain an estimate for the first parameter, then condition on all the parameters except the second parameter, use the estimates of parameters from the previous steps, and you keep going until the posterior density converges. A nice feature of Gibbs sampling is that you can just ignore nuisance parameters from the MCMC chain and focus on whatever parameters you are interested in. The author then introduces WinBUGS to infer the mean and variance of a sample from a normal distribution. Full working code is given in the text for a reader to actually type in the commands in WinBUGS. Of course these examples merely scratch the surface. There are umpteen things to learn to model with WinBUGS and there are obviously entire books written that flush out the details.

The author then mentions quickly the limitations of MCMC

  • All of MCMC techniques are simulations, not exact answers
  • Convergence to the posterior distribution can be very slow
  • Converge to a wrong distribution
  • Chain get stuck in a local parametric space
  • Might not be robust to changing priors

The author also warns that one should not specify vague priors to hyper parameters as it can lead to improper posterior densities. There is also a mention of the problem of simultaneous confidence intervals. Confidence region for a vector parameter has two difficulties. Firstly, it can be rather expensive in computer time and memory to calculate such a CR. Secondly it is usually difficult to convey the results in a comprehensible way. The section ends with a brief discussion on the Bayesian-Frequentist conflict, that I found it hard to understand. May be I will revisit this at a later point in time.

Chapter 8: Linear Models and ANOVA

The chapter starts off by explaining the Orthonormality principle in two dimensions. Orthonormality Principle marks the perfect tie-up which often exists between likelihood geometry and the least-squares methods. Before explaining the math behind the models, the author gives a trailer for all the important models. The trailers contain visuals that attempt to provide motivation to go over the abstract linear algebra setting taken in the chapter. These trailer concern the normal sampling theorem, linear regression and ANOVA. The second section of the chapter gives a crash course on Linear algebra principles and then states the orthonormality principle for projections. This principle gives us a way to look at the geometry for a linear model. If you are given a basic linear model Y = mu + sigma G , then one way to tie the likelihood geometry and inference is as follows :

  • Write down the likelihood function
  • MLE for mu is the projection on to the subspace U , relevant to null hypothesis space
  • Compute LR test statistic
  • Compute Deviance
  • Compute the F statistics
  • if n is large , then Deviance is related to chi^2 statistic

One very appealing feature of this book is that there is Bayesian analysis for all the models. So, one can see the WinBUGS code written by the author for linear regression and ANOVA. It gives the reader a good idea about the flexibility of Bayesian modeling. Even though the mathematics is not as elegant as the frequentist world, the flexibility of Bayes more than compensates for it. The chapter explains the main techniques for testing goodness of fit, i.e. qqplot, KS test(basically its a way to quantify quantile test) and likelihood based LR test. LR test can be approximated to classic Pearson Stat. base R allows you to draw qqplot for gaussian distributions. For other distributions, one can check the car package that not only supports different distributions but also draws confidence bands for the observed quantile data. Goodness of fit coupled with parameter estimation is a dicey thing. I had never paid attention to this till date. I was always under the impression that parameter estimation can be done and subsequently goodness of fit can be run based on the estimates. The author clearly gives the rationale for the problems that might arise by combining parameter estimation and goodness of fit testing.

There is a very interesting example that shows the sticking phenomenon, the MCMC chain sticks to a certain states and never converges to the right parameter value. The example deals with the data from a standardized t distribution and you have to infer the distribution of the degrees of freedom parameter. Gibbs sampling and WinBUGS do not give the right value of the degree of freedom parameter. In fact the chain never goes to the state relating to the true parameter. This example has made me curious to look up in the literature on ways to solve "sticking problem" in MCMC methods. The chapter ends with a section on Multivariate normal distribution that defines the pdf of MVN, list down various properties of MVN, derives the CI for the correlation between bivariate normal RVs, states Multivariate CLT.

Out of all the chapters in the book, I found this chapter to be most challenging and demanding. I had to understand Fischerian statistics, likelihood modeling, WinBugs, Bayesian hierarchical modeling so as to follow the math in this book. In any case I am happy with myself that the effort put in the preparation has paid off well. I have a better understanding of Frequentist and Bayesian modeling after reading through this chapter.

Chapter 9 : Some Further Probability

This chapter comprises three sections. First section is on Conditional probability.Standard stuff is covered like the properties of conditional expectation , applications to the area of branching etc. What I liked about this section is the thorough discussion of the "Two envelopes paradox". The takeaway is to be cautious about the conditional expectation of Random variables that are not in L1 space.

The second section is on Martingales. I think this section can form a terrific primer for someone looking to understand Martingales deeply. This can also be a prequel to working through the book "Probability with Martingales" from the same author. The section starts off by defining martingale and giving a few examples of martingales resulting from a random walk. The constant Risk Principle is stated as a criterion for checking whether the stopping time is finite or infinite. This is followed up Doob’s stopping time principle. The author states in this context that STP principle is the one of the most interesting subsections in the book. Indeed it is. Using the STP principle, many problems become convenient computationally. The highlight of this section is treating a “waiting time for pattern" problem as a martingale. Subsequently an extension of Doob’s STP, called the Doob’s optional sampling theorem is stated. The author also states Martingale – Convergence theorem, that he considers one of the most important results in mathematics.

Towards the end of the section, the author proves Black Scholes option pricing formula using Martingales. The third section is on Poisson processes. I think section is basically meant to provide some intuition of the diverse applications of conditional probability models. The last chapter of the book is on Quantum Probability and Quantum computing that was too challenging for me to read through.

clip_image002 Takeaway :

The author starts off the book with the following statement : ”Probability and Statistics used to be married; then they separated; then they got divorced; now they hardly ever see each other. The book is a move towards much needed reconciliation”. The book does a terrific job in rekindling that lost love. This is, by far the best book on “probability and statistics” that I have read till date. Awesome work by the humble Professor, David Williams.


This book reminds me of “Elementary Stochastic Calculus with Finance in view”, a book by Thomas Mikosch, in terms of the overall goal. This book has a goal of making the reader understand the nuts and bolts of Black Scholes pricing formula. Probability theory, Lebesgue integration and Ito Calculus are the main ingredients in the Black Scholes formula and these rely on set theory, analysis and an axiomatic approach to mathematics. Any thing in math is built ground up. This means that every idea/proof/lemma/axiom is pieced together in a logical manner so that the overall framework makes sense. This book introduces all the necessary ingredients in a pleasant way. There are some challenging exercises at the end of every chapter and the reader is advised to work through all of them, and the author motivates the reader by saying

An hour or two attempting a problem is never a waster of time and to make sure this happened, exercises were the focus of our small-group weekly workshops

Chapter 1: Money and Markets

The first chapter gives a basic introduction to the time value of money and serves as a basic refresher to calculus.

Chapter 2: Fair Games

The irrelevance of expectation pricing in finance is wonderfully illustrated in Baxter and Rennie’s book on Financial Calculus. Why is expectation based pricing dangerous? The reason being the expectation pricing is not enforceable. There is another kind of pricing that is enforceable and any other kind of pricing techniques is dangerous. This pricing technique goes by the name “arbitrage pricing”. This chapter starts off with a basic example where two people, John and Mark, play a betting game with each other. The expectation pricing will make sense in this case if they play a lot of games with each other.

The second example is where John and Mark place bets with a bookmaker on a two horse race contest. The bookmaker offers odds for each of the horse. These odds can be used to cull out implied probability of each horse winning. If the bookmaker does not quote odds based on arbitrage pricing, he risks going bankrupt. There is no place for expectations based pricing here. Based on the odds quoted, a particular horse might have two different probabilities with respect to John and Mark. Basically this means that John and Mark are operating in different probability spaces. If there is a single player betting one each horse, the bookmaker can quote odds and be done with it, assuring himself a guaranteed profit. However as the bets start accumulating, he becomes more and more prone to a risk of huge loss. He must either change the odds or hedge the exposure. To remove uncertainty from his exposure, the bookmaker can place a bet on the horse that whose win is likely to make him bankrupt, with another bookmaker. In this way he gets a guaranteed profit or at least can think of a breakeven.

If you think from a bookmaker’s perspective, all the activities he does like quoting the odds, hedging, changing the odds are the same activities of a derivative contract seller. In fact this chapter is a superb introduction to the concept of derivative pricing under equivalent martingale measure. I had loved this introduction in Baxter and Rennie’s and was thrilled to see the same kind of intro in this book. In fact I think any book on derivative pricing should have "Banish expectation pricing – Embrace arbitrage pricing" slogan at the very beginning.

Chapter 3: Set Theory

The chapter begins with motivating the reader that he/she has to go through theorems , proofs, lemma , corollaries as they are the organizing principles of any field. By learning these abstract ideas, one can apply the learning to various situations and is a much better way than learning case specific results. In the author’s words

The axiomatic approach does contain a degree of uncertainty not experienced when the classical approach of learning by rote and reproducing mathematical responses in carefully controlled situations us followed, but this uncertainty gradually diminishes as progress is achieved. The cause of this uncertainty is easy to identify. Over time one becomes confident that understanding has been acquired, but then a new example or result or failure to solve a problem or even to understand a solution may destroy this confidence and everything becomes confused. A re-examination is required, and with clarification, a better understanding is achieved and confidence returns. This cycle of effort, understanding, confidence and confusion keeps repeating itself, but with each cycle progress is achieved. To initiate this new way of thinking usually requires a change of attitude towards mathematics and this takes time.

Having given this motivation, the author talks about the mysterious “infinity” that is at the heart of mathematical abstraction. The reader is taken through countability and least upper bound / greatest lower bound concepts etc. After this selected journey through real numbers, the chapter talks about sigma algebras. May be some other books carried a visual about filtration. I don’t recollect it now. In any case, the best thing about this chapter is the visual that it provides for a discrete filtration.

In a typical undergrad probability one works with situations where the entire outcome space is visible right away, one does not need the concept of sigma algebras. In finance though, there is a time element for a random stock price or any random quantity. You do not know the entire outcome space. Information gets added as you move from one day to the next. So, in one sense, one needs to be comfortable with sigma algebras that are subsets of the master sigma algebra, a term that I am using just to make things easier to write. Not all subsets of outcome space select themselves as events at an appropriate time. Definitions of sigma algebra and measurable space are given. Subsequently, the concept of “generating set for the sigma field” is explored. The generating sets are usually small in size. A nice example is given where one is asked to describe the sigma field generated by a collection of subsets. Soon enough you realize that the exercise becomes tedious if you try to include unions and intersections manually. The mess is too painful to ward through. An alternative solution using partitions is presented in the chapter that makes the exercise of "sigma field generated by a collection of subsets " workable.The key idea is to relate partitions to equivalence classes and then use these equivalence classes to quickly generate the sigma algebra. In Shreve, I came across the phrase,“ Sets are resolved by the sigma algebra”. In this chapter though, the sentence is aptly summarized by many visuals. Getting a good idea about the filtration in discrete space will mightily help in transitioning to the continuous time space.

Chapter 4 : Random Variables

Random variables are basically functions that map from outcome space to R. Ideally one can just be in (Omega, F) and do all the computations necessary. However to take advantage of the rich structure of R, it should be connected with F. This is typically done using the concept of measurability. So, in a way one is talking about moving to a different space for computational ease. To move from one space to another, the structure has to be preserved. This structure goes by the name sigma field. The mappings are called measurable functions.

  • For measurable spaces (Omega, F), you have measurable functions to move to R
  • For measure spaces (Omega, F, P), you have random variables to move to R

Whenever one talks about Measurable functions, there are some core concepts that one needs to grasp. Firstly, the inverse mapping of the measurable functions should be defined in the F. Only then it makes sense. An intuitive way of thinking about measurable functions is that they are carrier of information. Once the basic criteria for measurable function is satisfied, then one needs to give some recognition to measurable function. So, we will give names to, let’s say the “sigma algebra generated by the random variable” to the inverse mappings of all the borel sets in R as the sigma algebra generated by X, denoted by F_X. Typically F_X is a subset of F. In order to check whether a function is a measurable function on a measurable space, one needs test candidates. Borel sigma algebra is very huge and hence testing each set is going to take ages. A convenient way is to select candidate sets that are collection of sets, A that generate a Borel sigma algebra, B. So, instead of checking all subsets of Borel sigma algebra, one can merely check whether the inverse mappings of the collection A is in the F.The total information available from an experiment is embedded in the sigma field of events F of the sample space Omega.

The book puts it this way :

The real valued function X on Omega may be regarded as a carrier of information in the same way as the satellite relays message from one source to another. The receiver will hopefully extract from the Borel set B information. An important requirement when transmitting information is accuracy. If X is not measurable, then inverse mapping is not an observable event and information. For this reason we require X to be measurable. If X is measurable, then the information transferred will be about events in F_X. Complete information will be transferred if F_X = F, and at times this may be desirable. On the other hand, F may be extremely large and we may only be interested in extracting certain information. In such cases, if X secures the required information, we may consider X as a file storing the information about all the events in F_X. In the case of a filtration, we obtain a filing system.

Chapter 5 : Probability Spaces

If you take a measure space and then attach a measure, it becomes a measurable space. It can be called a probability space if measure of the entire outcome space is 1 and measure satisfies countably additivity. With an experiment you can associate a probability space. Let’s say you have two variables with their probability spaces. If you want to combine the two probability spaces and create a more generic structure, you can go about it this way: First, define a combined outcome space, then define product sigma algebra and then extend the probability measure on to events in the product sigma algebra using the probability measures of the individual space.

Random variables are introduced in this chapter. These are measurable functions that are defined with a restriction on the type of measure. The random variables are mappings from the outcome space to real line. These random variables in turn generate a sigma algebra and thus there are two probability spaces that one can think of , in the context of a random variable, one the original space and the other the induced probability space that is defined on the outcome space of real line and having the event space borel sigma algebra. Thus two variables can be dependent or independent depending on the probability measure applicable to the measurable space Two random variables with the same measurable space but different probability measures can be dependent or independent. The chapter gives conditions that are necessary for the independence of two random variables. Obviously if the sigma fields generated by random variables are independent , then the variable are independent. But if the sigma fields generated by the random variables are not independent, then the independence is decided by the probability measure attached to the common event space.

Chapter 6 : Expected Values

Expected values, lengths and areas are measurements which share a common mathematical background. Basically if you want to measure something, you try to divide it in to arbitrary lengths and then approximate the lengths by a suitable number. During the final decade of 19th century, mathematicians in Paris began investigating which sets in Euclidean space R^n were capable of being measured.Emile Borel made the fundamental observation that all figures used in science could be obtained from simple figures such as line segments, squares and cubes by forming countable unions and taking complements. He suggested the term sigma field a collection of sets of large enough to cover most of the stuff that we come across. He defined the measure of a set as a limit of the measurements obtained by taking countable unions. He did not succeed, as he could not show that the resulting measure of a set was independent of the way it was built up from simple sets.

Henri Lebesgue used Borel’s ideas on countability and complements but proceeded in a different way. He defined measurability in a different way and in the process lead to the introduction of “Lebesgue measure” and “Lebesgue measurable set” and “ Lebesgue measurable spaces”.

There are 4 types of mathematical objects described in this chapter, i.e. simple random variables, positive bounded random variables, positive measurable variables and integrable random variables. Firstly, simple random variables are just step functions for which expectations are easy to compute. When you talk about expectation, one can use either E or integral sign to denote it. Survival of fittest did not seem to have happened in the notation for expectation. Using E for expectation is better for denoting conditional expectation, Martingales etc. However integral sign is good for showing expectation for disjoint subsets etc. It is important to pay attention to this fact of notation.

The second level of sophistication is defining a positive bounded random variable. Any positive bounded random variable has a canonical representation that involves simple random variables.All the properties of positive bounded random variables can be explored using the canonical representation.

The third level of sophistication is defining a positive random variable. Any positive random variable can be represented as a sequence of positive bounded variables. Using this fact, all the properties of positive random variables are stated and proved.

The fourth level of sophistical is defining an integrable random variable. Any integrable random variable can be represented by positive random variables

The pattern that is used across the four mathematical objects is the following :

  • Form two increasing sequences of a specific type of variable(simple random variables, positive bounded random variables, positive measurable variables and integrable random variables.
  • Assume they are pointwise convergent
  • Prove that expectation of both sequences converge

The reason for introducing these 4 types of mathematical objects is this: Use the expectation of simpler random variables to form the expectation of a sophisticated random variables. The chapter ends with proving two important theorems, Monotone convergence theorem and Dominated convergence theorem. These theorems are used in most of the proofs.

Chapter 7 : Continuity and Integrability

The following are the highlights of this chapter :

  • There is a connection between integrability and convergence of series of real numbers. The fact that every absolutely convergent series is convergent can be proved using a random variable defined on a probability triple.
  • Independence property between two random variables can be characterized by the expected values. If you take the product of two random variables and take the expectation, the expectation operator splits the product in to individual expectations.
  • Measures defined and introduced.
  • Riemann integral is defined for continuous functions and criterion for Riemann integration is given. Given this context, Lebesgue integral is introduced as an integral of a random variable with respect to a probability measure.
  • Law of Unconscious Statistician stated
  • Convergence Pointwise implies Convergence almost surely which in turn implies Convergence in Distribution
  • Chebyshev’s Inequality
  • Convex functions and Jensen’s inequality

Chapter 8 : Conditional Expectation

Instead of diving in to the topic of conditional expectation right away, the chapter introduced a 2 step binomial tree and shows the realizations of a conditional expectation variable for various sample paths. Principles such as discrete filtration, adapted to a filtration are discussed in the context of a two step tree. If the conditional sigma algebra is generated by countable partitions then the conditional expectation variable can be interpreted pointwise. However in other general cases, the conditional expectation needs to be inferred from the a few conditions. They are no specific formula for conditional expectation. In this context, the Radon Nikodym theorem is stated with out proof. Certain basic properties of conditional expectation are also stated and proved.

Chapter 9: Martingales

The chapter starts off with a basic definition of a discrete martingale adapted to a filtration on a probability space. Four examples are provided which give the reader enough knowledge to apply the principles in option pricing. The think I liked about this chapter is the Martingale convergence section. If you have a martingale and if it is bounded in Lebesgue measurable space, then it converges almost surely. This property of Martingale convergence is extremely useful in various problems. The line of attack for producing a distribution function for a random variable is to figure out a martingale that has this variable in it , apply martingale convergence theorem, find the limiting value of the martingale and from that expression, extract the distribution of the random variable. The chapter also defines continuous martingales and mentions Girsanov theorem in its elementary form. Chapter 10 talks about Black Scholes option pricing formula. The last chapter is a good rigorous introduction to stochastic integration.

This book painstakingly builds up all the relevant concepts in probability from scratch. The only pre-requisite for this text is “persistence” as there are a ton of theorems and lemmas throughout the book. The pleasant thing about the book is there are good visuals explaining various concepts. One may forget a proof but visual sticks for a long time. So, in that sense this book scores over many other books.


Solving an SDE analytically can be done only in few instances(toy SDEs). For the majority of the cases, one solves it numerically. Having said that, this book can be read by anyone who is interested in understanding SDEs better. Simulation is a great way to understand many aspects of Stochastic processes. For example, you can read through Girsanov theorem for change of measure, but by visualizing it through a few sample paths, you have a deeper understanding . I have managed to go over only the chapters that deal with simulation and my summary would obviously comprise only those chapters,i.e. the first two chapters. I have postponed reading Chapter 3 that goes in to inference and Chapter 4 that comprises a set of advanced topics. May be I will find time to go over it in the future. For now, let me mention a few points from the book.

The first chapter is a crash course on stochastic processes and SDEs. It zips past through basic probability concepts and then introduces change of measure. To give a practical application of change of measure, the chapter mentions preferential sampling technique and gives an example to illustrate its utility. All the relevant terms that one comes across in the context of stochastic processes are defined such as Filtrations, Measurability with respect to a filtration, Quadratic variation etc. As expected, the chapter provides R code that simulates a Brownian motion, Geometric Brownian Motion, Brownian Bridge etc.

In fact the good thing about this book is that it comes with cran package “sde” that has functions to simulate different stochastic processes. So, once you understand the way it has been coded, you can always make use of the functions provided in the package, instead of coding everything from scratch. Ito integral is defined via the limit of Ito integral of simple bounded functions. This approximation procedure is explained very well in Oksendal’s book. However if one is not inclined to go over the math and wants to visually see this, one can always check out the code in this book where a sequence of Ito integrals of non anticipating functions converge in mean square to Ito integral of a general integrand. All said and done, visual understanding is good but not enough. The effort of going through the math is well worth it as you will know exactly why the whole thing works. The chapter also lists the important properties of Ito Integral.

Diffusion processes are introduced and conditions for uniqueness and existence of the solutions are provided. These conditions are merely stated here and if one wants to understand the reasons behind those conditions, I will mention Oksendal’s book again that does a fantastic job of explaining the nuts and bolts. The thing I liked about this chapter is the visual illustration of Girsanov’s theorem. This is a theorem that helps change the drift of a Brownian motion. The code present in the book shows the changed likelihood of paths. I have tweaked a bit to illustrate two cases, change of measure that makes 1) a drifting Brownian motion to drift less Brownian motion and 2) a drift less Brownian motion to a drifting Brownian motion. Here is a sample of 30 sample paths and the respective path probabilities under change of measure.


The path probabilities that are darker imply higher weights to those paths. The visuals on the left hand side show that to make a drifting BM to a driftless BM, one has to weigh less the paths that are drifting. The visuals on the right show that, to make a driftless BM to drifting BM, one has to weight more the paths that tend to drift. This is exactly done by Radon Nikodym derivative.

The chapter ends by stating the following list of models/sdes and computing their moments, conditional density, conditional expectation and conditional variance(wherever possible)

  • Ornstein-Uhlenbeck or Vasicek process
  • Geometric Brownian motion model
  • Cox-Ingersoll-Ross model
  • Chan-Karolyi-Longstaff-Sanders (CKLS) family of models
  • Hyperbolic processes
  • Nonlinear mean reversion Sahalia model
  • Double-well potential model
  • Jacobi diffusion process
  • Ahn and Gao model
  • Radial Ornstein-Uhlenbeck process
  • Pearson diffusions
  • The stochastic cusp catastrophe model
  • Generalized inverse gaussian diffusions

The second chapter starts with two most popular techniques to solve SDEs, first is the Euler approximation and second is the Milstein scheme. Lamperti transform is introduced to show that Euler approximation on Lamperti transform of SDE is equivalent to Milstein scheme. Lamperti transform is explained via applying it to GBM , CIR process, that results in a simplified SDE. The workhorse of the chapter as well as the book is the function sde.sim().

The following can be accomplished using the above function

  • Processes that can be simulated: OU process, GBM , Cox-Ingersoll-Ross process, Vasicek process
  • Simulation method: Euler, KPS, Milstein, Milstein2, Condition density , Exact Algorithm, ozaki, and shoji
  • Law : One can simulate from the conditional law or the stationary law for the above 4 processes
  • Exact Algorithm: A very powerful algorithm that uses a biased Brownian motion and change of measure principles to simulate a solution for the SDE. The algorithm uses hitting time of Poisson process for simulating the SDE.

The chapter also gives a detailed example that shows the performance of Milstein vs. Euler approximation method. Even though simulating from the transition density is possible in only few cases, the chapter suggests that wherever possible it should be preferred over other simulation methods. There is a discussion of local linearization methods that takes of relaxes some assumptions of Euler/Milstein schemes, i.e. the drift and diffusion coefficient are not assumed constant in the partitioned time interval.

In the “sde” package, there are about 45 functions that one can use for SDE simulation and inference. I have tried categorizing the functions below (based on my limited understanding):