Right from the preface of the book, Prof. David Williams emphasizes that intuition is more important than rigour. The definition of probability in terms of long term frequency is fatally flawed and hence the author makes it very clear in the preface that “probability works only if we do not define probability in the way we talk about probability in the real world”. Meaning colloquial references to probability gives rise to shaky foundations. However if you build up probability theory axiomatically then the whole subject is as rigorous as Group theory. Statistics in the modern era is vastly different from yesteryears. Computers have revolutionized the application of statistics to real life problems. Most of the modern problems are solved by applying Bayes’ formula via MCMC packages. If this statement is surprising to you, then you should definitely read this book. Gone are the days when statisticians used to refer to some table of distributions, p-values etc. to talk about their analysis.In fact in the entire book of 500 odd pages, there is only about 15 pages of content on hypothesis testing and that too with a title “Hypothesis testing, if you must”. Today one of the critical ingredients of a statistician’s tool box is MCMC. Ok, let me attempt to summarize this book.
Chapter 1: Introduction
The author states that the purpose of the book is only to provide sufficient of a link between Probability and Statistics to enable the reader to go over advanced books on Statistics, and to show also that “Probability in its own right is both fascinating and very important”. The first chapter starts off by stating the two most common errors made in studying the subject. The first common mistake when it comes to actually doing calculations is to assume independence. The other common error is to assume that things are equally likely when they are not. The chapter comprises a set of examples and exercises to show the paradoxical nature of probability and mistakes that we can make if we are not careful in understanding the subject properly. The Two Envelopes Paradox illuminates the point that measure theoretic understanding of the subject is critical to understanding the subject. The idea of probability as a long-term relative frequency is debunked with a few examples that motivate the need for understanding measure theory.
This chapter is a charm as it encourages readers to embrace measure theory results in a manner that helps intuition. It says that by taking a few results from measure theory for granted, we can clearly understand the subject and theorems. The author says that “Experience shows that mathematical models based on our axiomatic system can be very useful models for everyday things in the real world. Building a model for a particular real-world phenomenon requires careful consideration of what independence properties we should insist that out model have”.
One of the highlights of the book is that it explains the linkage between probability and statistics in an extremely clear way. I have not come across such a lucid explanation of this linkage anywhere before. Let me paraphrase the author here,
In Probability, we do in effect consider an experiment before it is performed. Numbers to be observed or calculated from observations are at that stage Random Variables to be observed or calculated from observations – what I shall call Pre-Statistics, rather nebulous mathematical things. We deduce the probability of various outcomes of the experiment in terms of certain basic parameters.
In Statistics, we have to infer things about the values of the parameters from the observed outcomes of an experiment already performed. The Pre Statistics have now been crystallized in to actual statistics, the observed number or numbers calculated from them.
We can decide whether or not operations on actual statistics are sensible only by considering probabilities associated with the Pre-Statistics from which they crystallize. This is the fundamental connection between Probability and Statistics.
The chapter ends with the author reiterating the importance of computers in solving statistical problems. Not always analytical closed forms are desired. Sometimes they are not even possible, so one has to resort to simulation techniques. May be 20 years back, it was ok to somehow do statistics without much exposure to statistical software, not any more. Computers have played a fundamental role in the spread and understanding of statistics. Look at the rise of R and you come to the inevitable conclusion that “statistical software skills are becoming as important as modeling skills”.
Some of the quotes by the author that I found interesting are
The best way to learn how to program is to read programs, not books on programming languages.
In regard to Statistics, common sense and scientific insight remain, as they have always been, more important than Mathematics. But this must never be made an excuse for not using the right Mathematics when it is available.
Intuition is more important than rigor, though we do need to know how to back up intuition with rigor.
Our intuition finds it very hard to cope with sometimes perverse behavior of ratios
The likelihood geometry of the normal distribution, the most elegant part of Statistics, has been used to great effect, and remains an essential part of Statistical culture. What was a good guide in the past does not lose that property because of advances in computing. But, modern computational techniques mean that we need no longer make the assumption of normally distributed errors to get answers. The new flexibility is exhilarating, but needs to be used wisely
Chapter 2: Events and Probabilities
The second chapter gives a whirl wind tour of measure theory. After stating some basic axioms of probability like addition rule and inclusion-exclusion principle, the chapter goes on to define all the important terms/theorems of measure theory. This chapter is not a replacement for a thorough understanding of measure theory. However, if one does not want to plough through measure theory AT ALL, but wants to have just enough idea to get going, then this chapter is useful. For readers who already know measure theory, this chapter serves as a good recap of the subject that is necessary to understand statistics. The concepts introduced and stated in this chapter are Monotone Convergence Properties, Almost surely, Null sets, Null events, Borel-Cantelli Lemma, Borel Sets and functions, Pi System lemma. A good knowledge of these terms is priceless in understanding stats properly. Banach-Tarski Paradox is also stated so that one gets a good idea of the reason behind leaving out non measurable sets for constructing event space.
Chapter 3: Random Variables, Means and Variances
The tone of the book is sometimes funny. The author at the beginning of the chapter says
This chapter will teach you nothing about probability and statistics. All the methods described for calculating mean and variances are dull and boring and they are better efficient indirect methods described in the book in the later chapters.
Random variables are basically measurable functions and the reason for choosing Lebesgue/ Borel measurable functions is that it is hard to escape from the measurable world after you perform additions/subtractions/ multiplications/ taking limits on these functions etc. The author introduces a specific terminology in the book that helps the reader to understand the linkage between Statistics and Probability. A Pre-Statistic is a special kind of Random variable: a Random variable Y is a Pre-Statistic if the value Y(w_actual) will be known to the observer after the experiment is performed. Y(w_actual) becomes the observed value of Y. An example mentioned in the book which brings out the difference:
Let’s say you want to know the width of an object and let’s say you have 100 different observations for the width of the object. You can calculate the mean and variance of residual. But to crystallize these, one needs to study the Pre-Statistics M and RSS , know their distribution so one can compute confidence intervals. So, the Pre-Statistics become important and are a part of Probability. The inference of the parameters form a part of Statistics.
The chapter then talks about distribution functions, probability mass functions and probability distribution functions. Expectation of random variables is introduced via Simple random variables and non negative random variables. I really liked the discussion provided about supremum to connect ideas between expectation of general non negative random variable and simple non negative random variable . By going through the discussion, I am able to verbalize the ideas in a better way. Wherever measure theory is used, it is emphasized with letter M so that a disinterested reader can skip the content. However it is very important for any student of statistics to know the connection between integration and expectation. Lebesgue integral of a non negative measurable function is defined in terms of supremum of integral of simple random variables. This is where the connection to integration appears.
The chapter quickly introduces L1 space, L2 space and shows how various random variables fit in. The mean of a random variable can be computed if the random variable lies in L1 space whereas variance makes sense for a variable in L2 space. The fact that L1 space is a subset of L2 space and so on and so forth is a deep result, a result which becomes clear after having some knowledge of functional analysis, normed vector spaces, metric spaces, etc. So, my feeling is that a reader who has never been exposed to these spaces like Hilbert spaces would fail to appreciate the various aspects mentioned in this chapter. There is no alternative but to slog through the math from some other book to understand these concepts. However this chapter serves as a wonderful recap of functional spaces. The fact that the variance of a standardized random variable lies in L2 space needs to be understood one way or the other, be it through slog or through right kind of intuition.
Chapter 4: Conditioning and Independence
The chapter starts with these words
Hallelujah! At last, things can liven up.
The author is clearly kicked about the subject and more so about conditioning. The author’s enthusiasm might be infectious too. At this point in the book, any reader would start looking forward to the wonderful world of conditioning.
The chapter introduces the basic definition of conditional probability, states Bayes’ theorem, gives the famous Polya’s urn example to illustrate the use of conditioning. Polya’s urn is described as “a thing of beauty and joy forever”, meaning the Polya’s urn model is very rich mathematically. Indeed, there are books on probability models where all the models have Polya’s urn scheme as the spine. Look at this book Polya Urn Models , where innumerable models are discussed with Polya Urn in the background. Simple examples of Bayesian estimation is followed up with a rather involved example of Bayesian Change Point detection.
I think the section on genetics and the application of conditional probability has been the biggest take away from this chapter and probably from the entire book. I am kind of ashamed to write here that I had not read any books by Richard Dawkins, one of the greatest living scientific writers. Sometimes I am amazed at my ignorance on various things of life and then I am reminded of the innumerable activities that I had wasted my time on, in the past. Well, at least I am glad that I now recognize the value of time in a better way. Anyways the author strongly advocates the reader to go over at least a few books by Dr Richard Dawkins. Thanks to the torrent sites, all the books by Richard Dawkins are available for free. I have downloaded a couple and have put them on my reading list. At least I should read “The selfish gene” and “The Blind watch maker” soon. Apart from the fantastic theory, there is a tremendous scope to see the application of conditional probability to the field of genetics. I hope to understand these aspects as I keep working on honing my probability skills. Ok, coming back to the summary, the chapter then introduces the concept of independence , Borel Cantelli Lemma and shows through an example the connection between lim sup and Borel Cantelli Lemma. There are some exercises in the chapter that are challenging. “Five Nation problem” is something that I found interesting. It goes like this : If 5 nations play a sport against each other , i.e total 10 games , what is the probability that each nation wins 2 games? The answer is 3/128 and it is not immediately obvious. One needs to think a bit to get to the solution. I worked out using a simulation and then solved it analytically.
I knew the way to calculate the pdf of order statistics using a rather tedious approach. Crank out the cumulative cdf of IID variables, differentiate and prove that the order statistic follows a beta distribution. Williams explores the same statistic and proves intuitively but rigorously without resorting to calculus. I had persevered to understand the calculus approach earlier and hence this intuitive approach was a pleasant surprise that I will forever remember. The order statistics for the median of a uniform random sample is calculated and it is immediately followed up with an example. This is the recurrent pattern of the book. Whenever you see some statistic being estimated, in this case the median of the n iid uniform rvs, it is followed up with the usage of this median statistic .As one is aware, Cauchy distribution has no mean. By merely observing N Cauchy variable, mean will not help you estimate the parameter. It is the median that comes to your rescue. The median is likely to be close to the parameter in the Cauchy distribution.
The section of Law of Large numbers has an interesting example called the Car Convey problem which shows that a simple probability model can have a sophisticated mathematical underpinning.This example is a great application of Borel Cantelli Lemma and I hope to forever remember. Basic inequalities like Tchebyshev’s and Markov are introduced, the former being used in the derivation of weak law of large numbers.
Every experiment no matter how complicated it is, can be reduced to choosing a number from the uniform distribution (0,1). This fact is intuitively shown using Borel’s Strong Law for coin tossing. Kolmogorov’s Strong Law of Large numbers is stated and proved in an intuitive manner with less frightening measure theory math. In the end it is amply clear that the SLLN deals with almost sure convergence and WLLN deals with convergence in probability. This is whether things tie in. In an infinite coin tossing experiment it is the SLLN that is invoked to show that the proportion of heads almost surely converges to ½ and there are many sequences which do not converge to ½ but whose measure is 0. So, Convergence in probability is weak in the sense that average might not converge to mean at all but the probability that it converges to mean can be 1. The author then states the “Fundamental Theorem of Statistics”, (Glivenko- Cantelli) that talks about the convergence of sample distribution function to the population distribution function.Basically this is reason why bootstrapping theory works. You compute the ecdf and work with it for all sorts of estimation and inference aspects. The section also mentions a few examples which show that weak law is indeed weak -J , meaning the long term average can significantly differ from mean , but at the same time the long term average converges in probability to the mean.
The chapter then introduces Simple Random walks. This section is akin to compressing Chapter 3 of Feller Volume 1, stripping all the math and explaining the intuition behind reflection principle, hitting and return probabilities, etc. There is a section titled “Sharpening our intuition” which was highly illuminating. Some of the formulas derived in Feller show up in this section with the author showing the calculations behind hitting time and return time probabilities. Reflection Principle and its application to Ballot Theorem are also mentioned. As I go over these examples, I tend to remember statements made by Taleb, Rebanato, etc. who warn the fact that world of finance cannot be modeled using coin tossing models!.At the same time I think if it improves my intuition about randomness, I would not care whether these principles are useful in finance or not. Well, certainly I would love to apply these things in finance, but if my intuition improves in the process of working through them, that’s good enough. Anyways I am digressing from the intent of the post, i.e to summarize.
Some of the problems involving random walks are solved by forming difference equations. These difference equations inevitably contain a property that process starts afresh after a certain event. For example, if you are looking at gambler’s ruin problem ( you start with a capital a and you win if you can get to b before you hit 0), then after your first win, the gambler is facing the same problem as before except that you now have a capital of a+1 and you will win if you get to b or lose if you hit 0. In such examples where intuition yields correct results, one must also understand the details behind this “process starting afresh”. This topic is explored in a section title “Strong Markov Principle”. To understand Strong Markov Principle, one needs to understand Optional Stopping time. The highlight of this book as mentioned earlier, is that things are explained intuitively with just the right amount of rigour. Optional Stopping time is introduced in texts such as Shreve or some Math Finance Primer using heavy math. Books such as this need to be read by any student so that he /she gets the intuition behind optional stopping time. Williams describes this problem by asking a simple question, “For what kind of random time will the process start afresh”? This means that there are certain random variables, basically time variables that can be called as stopping Times if T<=n can be calculate from the variables X1,X2,…Xn. Generally Stopping time is the first time that something observable happens. With this definition, one can look at what sort of events that can be considered so that T can be stopping time for such variables. One must carefully understand this stuff that T is a stopping time for a process X under certain conditions. The book subsequently states “Strong Markov Principle” that can then be used in situations to justify that process starts again.
The chapter ends with a note on describing various methods to generate IIDS. Various random generators are explored like multiplicative, mixed multiplicative and congruential generators. Frankly I am not really motivated to understand the math behind these generators as most of them are coded in statistical packages. Well, whether it uses some complicated number theory properties or some trick is not something I want to spend time on. So quickly moved to the end of this section where Acceptance Sampling is described. Well, inverting a cumulative CDF and getting a random univariate from a distribution is not always possible as inverse for a CDF might not have a closed form. It is precisely to solve such a type of problem that accept-reject sampling procedure is tremendously useful. For a reader who is coming across the method for the first time, he/she typically wonders why it works. Ideally one must get the intuition right before heading in to the formal derivation of the procedure. The intuition as I understand from the book is as follows :
It is hard to simulate samples from a complicated pdf lets say f(x)
Choose a pdf which is analytically tractable and in such a way that that the scaled version of this pdf and the original pdf have same support values,lets call it g(x)
Use a criterion to accept the sample value from g(x) or reject it. – The logic here is : let’s say the ratio of f(z) / scaled g(z) 0.6 for z1 and 0.1 for z2, then one must prefer z1 as compared to z2. This can be made concrete by simulating a random number and comparing it with f(z) / scaled g(z) and then subsequently accepting or rejecting the simulated value.
The key is to understand the criterion which is in an inequality form. Obviously there are cases where Accept-Reject methods fail. In such situations, Gibbs sampling or MCMC methods are used. There is a lot of non-linearity in going about understanding statistics. Let’s say you are kicked about MCMC and want to simulate chains, you got to learn a completely new software and syntax ( WinBUGS). If you want to use R to further analyze the output, you got to use BRugs package. I just hope my non-linear paths become helpful in the overall understanding and don’t end up being unnecessary deviations from learning stuff.
Chapter 5: Generating Functions and Central Limit Theorem
Sum of random variables is something that one comes across in umpteen cases. So, it is imperative to know the joint distribution or joint pmf for the sum of random variables. The chapter talks about probability generating function (PGF), Moment generating function (MGF), Characteristic function(CF), Laplace transformation(LT) and Cumulant function(C).
Probability generating function is akin to signature of a random variable where one function can summarize the probabilities of various realizations. The key theorem that follows from these functions are – If the generating function for a sequence of random variables converges to the generating function of the limit Random variable, then probabilities also converge.
A pattern is followed in this chapter. For each of the 5 functions (probability generating function PGF, moment generating function MGF, characteristic function CF, Laplace transformation LT, Cumulant function), the following are proved
Convergence of PGF/MGF/CF/LT/C for a sequence of variables,X_n to a PGF/MGF/CF/LT/C of X means distribution function for the sequence of X_n converges to the distribution function of X
Independence of RV means one can multiply the PGF/MGF/CF/LT/C for each of the RVs
Uniqueness : If PGF/MGF/CF/LT/C for two variables match then the distributions for the two variables match
CLT, the workhorse of statistics is stated and proved using cumulant generating function approach. Integer correction form of CLT is stated for discrete random variables.
Chapter 6: Confidence Intervals for one parameter models
The chapter starts off with a basic introduction to frequentist view of point estimation. Using the Pre-Statistics, one can form the confidence intervals and then after the experiment is performed, the confidence intervals are computed. The point to note is that one makes a statement on the confidence interval and does not say anything about the parameter as such. In Williams’ words
A frequentist will say that if he were to repeat the experiment lots of times, each performance being independent of the others and announce after each performance that, for that performance, parameter belongs to a specific confidence interval, then he would be right about C% of the time and would use THAT as his interpretation of the statement that on the one occasion that the experiment is performed with result yobs, the interval as a function of yobs is a C% Confidence Interval.
The chapter then introduces the concept of likelihood function. The immediate relevance of this function is shown by the introduction of sufficient statistics. Sufficient statistics are very important in the sense that the sufficiency statistic is a lower dimension version of the random sample. How do you check whether the statistic is a sufficient statistic? One can use the factorization principle / criterion for sufficiency mentioned. One should be able to split the likelihood function in to two functions one that is independent of the parameter and other a function of sufficient statistic. If one is able to accomplish this, one can call this a sufficient statistic. An extremely good way to highlight the importance of sufficient statistic is given in the book, which is worth reiterating:
Let t be a sufficient statistic and let T be the PreStatistic which crystallizes to t. Let fT be the conditional density of T|theta when the true value of the parameter is theta. Then Zeus (father of gods) and Tyche can produce IID RVs Y1,Y2,…Yn each with pdf f(.|theta) as follows
Stage 1: Zeus picks T according to the pdf fT. He reports the chosen value of T but NOT the value of theta , to Tyche
Stage 2: Having been told the value of T, but not knowing the value of theta, Tyche performs an experiment which produces the required Y1,Y2,…Yn
Thus sufficient statistic tells everything that is needed to know about the sample.
Estimator is basically a function of the realized sample. This is what I vaguely remember having learnt. However the section makes it very clear that the estimator is a Pre Statistic and hence is a function of n random variables. One can study the pdf, pmf and everything related to it using probability. However once the experiment is performed, the estimator leads to an estimate. One of the basic criteria required for estimators is that it should be unbiased, meaning the expectation of the estimator given parameter theta is theta. However there can be many unbiased estimators that satisfy the above definition. What is the method to choose amongst the unbiased ones? Here is where the concept of efficiency comes in. Any unbiased estimator has a minimum variance bound defined by the inverse of Fischer Information ratio. If you have a bunch of estimators and one of them attains the minimum bound, then you can stop your search procedure and take the estimator as MVB unbiased estimator. There is a beautiful connection between Fischer Information and Variance of an estimator that is evident in Cramer-Rao Minimum Variance Bound. The inverse of Fischer information thus becomes the variance of the estimator. This section explains the difference between an Unbiased Estimator & MVB Unbiased Estimator for a parameter.
MLE are introduced not for its own sake but to throw light on confidence intervals. Firstly MLE is an estimator and like any estimator, it is a pre-statistic. The way MLE is connected to confidence intervals is through the fact that MLE is both MVB unbiased and approximately normal. This enables one to form Confidence intervals for the parameter pretending that it has a posterior density of a normal distribution where the variance is related to Fisher information ratio. The section then moves on to prove the consistency of MLE whereby as sample size increases the MLE estimator converges to the true value of the parameter. To appreciate the connection between MLE approximating to normal and forming confidence bands, Cauchy distribution is used. Typically Cauchy distribution has no variance and hence for large n, the usual methods are useless. MLE based confidence interval can be formed for a crazy distribution like Cauchy.
To prove the consistency of MLE, the concept of entropy and relative entropy is introduced. Relative entropy gives a measure of how badly a function f is approximated by g. This is also called Kullback-Leibler relative entropy. The connection between entropy and normal distribution is shown and subsequently the importance of normal distribution in Central Limit theorem is highlighted.
The chapter then goes on to cover Bayesian confidence intervals. The basic formula applicable for Bayesian statistics is that posterior is prior times likelihood. How does one choose priors ? What are the different kinds of priors ?
Improper priors : These priors cannot be normalized to a proper pdf
Reference priors : Priors chosen based on Fischer information
Conjugate priors : The prior is chosen in such a way that the prior times likelihood becomes a prior distribution with different parameters
If one were to make any sensible inferences about parameters, then
Either one must have reasonable good prior information, which one builds in to one’s prior
or if one has only vague prior information, one must have an experiment with sample size sufficient to make one’s conclusions rather robust to changes in one’s prior
Whenever one chooses prior there is always a possibility that prior and likelihood are conflict arises where in the prior times likelihood is always so small and so the posterior density is something close to 0/0 . This is the main reason for the lack of robustness. The author gives an overview of Hierarchical modeling. He also mentions the famous Cromwell’s dictum and considers an application where a mixture of conjugates is taken and imaginary results are used to select a decent prior. Cromwell’s dictum basically says that one’s prior density should never be literally zero on a theoretically possible range of parameter values. One must regard it as a warning to anyone who chooses a prior density that is too localized and which is therefore extremely small on a range of parameter values which we cannot exclude. So, overall the section on Bayesian confidence intervals compresses the main results of Bayesian statistics in 20 odd pages. It distills almost every main feature of Bayesian stats.
In a book that is ~ 500 odd pages, Hypothesis testing takes up ~20 pages and that too with the title ,”Hypothesis testing – if you must”. This is mainly to remind readers that Hypothesis testing is a tool that has become irrelevant and the way to report estimates is through Confidence intervals.
CIs usually provide a much better way of analyzing data
With this strong advice against using hypothesis testing, the author does give all the main aspects of Hypothesis testing. The story as such is straightforward to understand. The MOST important test in frequentist world is Likelihood ratio test. In the frequentist world, Null and Alternate hypothesis are not placed on equal footing. Null is always “innocent unless proven guilty”. The idea behind LR test is this : Divide the parameter space in to two, as per the hypothesis. Compute the max likelihood of the data given the alternate parameter space and max likelihood of the data given the null parameter space, take the ratio of the two. Reject Null if the ratio is high. Sensible approach. But this strategy can reject null when infact null is true. So, new characters enter in to the picture. First is power function that evaluates the probability of rejection region given a parameter value. Ideally an experimenter would want power function to take 0 on null and 1 on alternate parameter space. Since this is too much to ask for, he sets close to 0 value for beta function in the null parameter space and close to 1 for the beta function in the alternate parameter space. This type of selecting values for beta function is a way to control Type I and Type II error. Having done, the experimenter computes the sample size of the experiment and the critical value of the LR test. There is a nice connection to chi square distribution that helps us quickly get through all this. Deviance is defined to be twice the logarithm of Likelihood ratio and this turns to be chi squared distributed. Thus one can directly use deviance to accept or reject null. Having given all these relevant details, the author reiterates the point that it is better to cite Confidence intervals, HDIs than doing hypothesis testing. In that sense, Bayesian stats is definitely appealing. But the problem with Bayes is priors. Where do we get them? The chapter ends with a brief discussion of model selection criteria.
Chapter 7: Conditional Pdfs and Multi-Parameter Bayes
The chapter starts off by defining joint pdfs/ conditional pdf/ joint pmf/ conditional pmf . It then shows a geometric derivation of Jacobian. Transformation of random variables is then introduced so that one can calculate the joint density of the transformed variable. One point which is often missed in the intro level probability books is that , when we talk about pdf , it is “a pdf” and not “the pdf” because one can define many pdfs which agree on all points except on a set of measure 0(Subtle point to understand0. For a novice, one might have to go through measure theory to understand. The section then derives Fischer and t distributions using transformation of appropriate random variables. Simulation of values from gamma and t distributions are discussed in detail. The following points summarize the interconnections
Frankly I speed read this part, becoz , well all it takes to generate a gamma rv is rgamma function and t rv is rt from R functions. Somehow when I first came across pseudo random numbers I was all kicked to know about the gory details about it, but then as time passed , I have developed an aversion to go in too much depth about these random number generators. As long as there is standard library that can do the job, I don’t care how it’s done. The chapter then introduced Condition pdf and marginal pdf to prepare ground for multiparameter Bayesian statistics.
If one needs to estimate mu and sigma for the data, one can form priors for each of these parameters, invoke the likelihood function and then compute the posterior pdf can be written down. However the expression is rather complicated. Conditional on mu, the expression is tractable and so also conditional on sigma the expression is tractable.
The idea of the Gibbs sampler , as for all MCMCs is that being given a sample with empirical distribution close to a desired distribution is nearly as good as being given the distribution itself
The last section of this chapter is about Multi-Parameter Bayesian Statistics. The content is extremely concise and I found it challenging to go over some of the content, especially towards the content that mentions Bayesian-Frequentist conflict. In a typical posterior density computation, the biggest challenge is the constant that needs to be tagged along with prior times likelihood function.The constant of proportionality is very hard to evaluate even numerically. This is where Gibbs sampling comes in. It allows one to get to the posterior density by simulation. The idea of Gibbs sampler is that being given a sample with Empirical Distribution close to a desired distribution is nearly as good as being given the distribution itself. In a posterior pdf that usually looks very complicated, if you condition on all the parameters except one, you get a tractable conditional density. So, the logic behind Gibbs sampling is : You condition on all the parameters except the first one, obtain an estimate for the first parameter, then condition on all the parameters except the second parameter, use the estimates of parameters from the previous steps, and you keep going until the posterior density converges. A nice feature of Gibbs sampling is that you can just ignore nuisance parameters from the MCMC chain and focus on whatever parameters you are interested in. The author then introduces WinBUGS to infer the mean and variance of a sample from a normal distribution. Full working code is given in the text for a reader to actually type in the commands in WinBUGS. Of course these examples merely scratch the surface. There are umpteen things to learn to model with WinBUGS and there are obviously entire books written that flush out the details.
The author then mentions quickly the limitations of MCMC
All of MCMC techniques are simulations, not exact answers
Convergence to the posterior distribution can be very slow
Converge to a wrong distribution
Chain get stuck in a local parametric space
Might not be robust to changing priors
The author also warns that one should not specify vague priors to hyper parameters as it can lead to improper posterior densities. There is also a mention of the problem of simultaneous confidence intervals. Confidence region for a vector parameter has two difficulties. Firstly, it can be rather expensive in computer time and memory to calculate such a CR. Secondly it is usually difficult to convey the results in a comprehensible way. The section ends with a brief discussion on the Bayesian-Frequentist conflict, that I found it hard to understand. May be I will revisit this at a later point in time.
Chapter 8: Linear Models and ANOVA
The chapter starts off by explaining the Orthonormality principle in two dimensions. Orthonormality Principle marks the perfect tie-up which often exists between likelihood geometry and the least-squares methods. Before explaining the math behind the models, the author gives a trailer for all the important models. The trailers contain visuals that attempt to provide motivation to go over the abstract linear algebra setting taken in the chapter. These trailer concern the normal sampling theorem, linear regression and ANOVA. The second section of the chapter gives a crash course on Linear algebra principles and then states the orthonormality principle for projections. This principle gives us a way to look at the geometry for a linear model. If you are given a basic linear model Y = mu + sigma G , then one way to tie the likelihood geometry and inference is as follows :
Write down the likelihood function
MLE for mu is the projection on to the subspace U , relevant to null hypothesis space
Compute LR test statistic
Compute the F statistics
if n is large , then Deviance is related to chi^2 statistic
One very appealing feature of this book is that there is Bayesian analysis for all the models. So, one can see the WinBUGS code written by the author for linear regression and ANOVA. It gives the reader a good idea about the flexibility of Bayesian modeling. Even though the mathematics is not as elegant as the frequentist world, the flexibility of Bayes more than compensates for it. The chapter explains the main techniques for testing goodness of fit, i.e. qqplot, KS test(basically its a way to quantify quantile test) and likelihood based LR test. LR test can be approximated to classic Pearson Stat. base R allows you to draw qqplot for gaussian distributions. For other distributions, one can check the car package that not only supports different distributions but also draws confidence bands for the observed quantile data. Goodness of fit coupled with parameter estimation is a dicey thing. I had never paid attention to this till date. I was always under the impression that parameter estimation can be done and subsequently goodness of fit can be run based on the estimates. The author clearly gives the rationale for the problems that might arise by combining parameter estimation and goodness of fit testing.
There is a very interesting example that shows the sticking phenomenon, the MCMC chain sticks to a certain states and never converges to the right parameter value. The example deals with the data from a standardized t distribution and you have to infer the distribution of the degrees of freedom parameter. Gibbs sampling and WinBUGS do not give the right value of the degree of freedom parameter. In fact the chain never goes to the state relating to the true parameter. This example has made me curious to look up in the literature on ways to solve "sticking problem" in MCMC methods. The chapter ends with a section on Multivariate normal distribution that defines the pdf of MVN, list down various properties of MVN, derives the CI for the correlation between bivariate normal RVs, states Multivariate CLT.
Out of all the chapters in the book, I found this chapter to be most challenging and demanding. I had to understand Fischerian statistics, likelihood modeling, WinBugs, Bayesian hierarchical modeling so as to follow the math in this book. In any case I am happy with myself that the effort put in the preparation has paid off well. I have a better understanding of Frequentist and Bayesian modeling after reading through this chapter.
Chapter 9 : Some Further Probability
This chapter comprises three sections. First section is on Conditional probability.Standard stuff is covered like the properties of conditional expectation , applications to the area of branching etc. What I liked about this section is the thorough discussion of the "Two envelopes paradox". The takeaway is to be cautious about the conditional expectation of Random variables that are not in L1 space.
The second section is on Martingales. I think this section can form a terrific primer for someone looking to understand Martingales deeply. This can also be a prequel to working through the book "Probability with Martingales" from the same author. The section starts off by defining martingale and giving a few examples of martingales resulting from a random walk. The constant Risk Principle is stated as a criterion for checking whether the stopping time is finite or infinite. This is followed up Doob’s stopping time principle. The author states in this context that STP principle is the one of the most interesting subsections in the book. Indeed it is. Using the STP principle, many problems become convenient computationally. The highlight of this section is treating a “waiting time for pattern" problem as a martingale. Subsequently an extension of Doob’s STP, called the Doob’s optional sampling theorem is stated. The author also states Martingale – Convergence theorem, that he considers one of the most important results in mathematics.
Towards the end of the section, the author proves Black Scholes option pricing formula using Martingales. The third section is on Poisson processes. I think section is basically meant to provide some intuition of the diverse applications of conditional probability models. The last chapter of the book is on Quantum Probability and Quantum computing that was too challenging for me to read through.
The author starts off the book with the following statement : ”Probability and Statistics used to be married; then they separated; then they got divorced; now they hardly ever see each other. The book is a move towards much needed reconciliation”. The book does a terrific job in rekindling that lost love. This is, by far the best book on “probability and statistics” that I have read till date. Awesome work by the humble Professor, David Williams.