Probability


image

imageTakeaway:

We see/hear/talk about “Information”  in many contexts. In the last two decades or so, one can also go and make a career in the field of “Information” technology. But what is “Information” ? If someone talks about a certain subject for 10 minutes in English and 10 minutes in French, Is the “Information” same in both the instances?. Can we quantify the two instances in someway ? This book explains Claude Shannon’s remarkable achievement of measuring “Information” in terms of probabilities. Almost 50 years ago, Shannon laid out a mathematical framework and it was an open challenge for engineers to develop devices and technologies that Shannon proved as a “mathematical certainty”. This book distils the main ideas that go in to quantifying information with very little math and hence makes it accessible to a wider audience. A must read if you are curious about knowing a bit about “Information” which has become a part of every day’s vocabulary.

image

 

image Takeaway :

I think this book needs to be read after having some understanding of BUGS software and also having some R/S programming skills. That familiarity can help you simulate and check for yourself the various results and graphs, the author uses to illustrate Bayesian concepts. The book starts by explaining the essence of any econometric model and the way in which an econometrician has to put in assumptions to obtain posterior distribution of various parameters. The core of the book is covered in three chapters, the first two chapters covering model estimation and model checking, and the fourth chapter of the book covering MCMC techniques. The rest of the chapters cover linear models, non linear models and time series models. There are two chapters, one on Panel data and one on Instrument variables that are essential for a practicing econometrician for tackling the problem of endogenous variables. BUGS code for all the models explained in the book are given in the appendix and hence the book can serve as a quick reference for BUGS syntax. Overall a self- contained book and a perfect book to start on Bayesian econometric analysis journey.

image

The author is a CS professor at SUNY, Stony Brook. This book recounts his experience of building a mathematical system to bet on the play outcomes of what is considered the fastest ball game in the world, “Jai alai”. In the English vernacular this is sometimes spelled as it sounds,that is, “hi-li”.  The book recounts the history of the game and how it made to US from Spain and France. However the focus of the book is on using mathematical modeling and computers to analyze the game and design a betting system. The game itself is designed in such a way that it is a textbook case for analyzing the game mathematically. The players enter the competition based on FIFO queue and the player who gets to score 7 points is the winner. It takes hardly a few minutes to understand the game from this wiki.

With the help of some of his grad students, the author works on the following questions :

  • Given a player starts in a specific position, what is probability that he ends up in a Win/Place/Show ?
  • What are the best combination of numbers that have the highest probability of winning a Trifecta ?
  • How does one build a statistical model to evaluate the relative skills of the players ?
  • Given that two players A and B have probabilities of winning as pb and pb, How does one construct a model that evaluates the probability of A winning over B ?
  • How does one create a payoff model for the various bets that are allowed in the game ?
  • How do you deal with missing  / corrupt data ?
  • Given the 1) payoffs of various bets, 2) the probabilities of a player winning from a specific position, and 3) the relative skillsets, how does one combine all of these elements to create a betting strategy ?

I have just outlined a few of the questions from the entire book. There are numerous side discussions that makes the book a very interesting read. Here is one of the many examples from the book that I found interesting :

Almost every person who learns to do simulation comes across Linear congruential generator(LCG), one of the basic number theory technique to generate pseudo random numbers. It has the following recursion form :

By choosing appropriate values for a, c and n, one can generate pseudo random numbers.

The book connects the above recursive form to a roulette wheel :

Why do casinos and their patrons trust that roulette wheels generate random numbers? Why can’t the fellow in charge of rolling the ball learn to throw it so it always lands in the double-zero slot? The reason is that the ball always travels a very long path around the edge of the wheel before falling, but the final slot depends upon the exact length of the entire path. Even a very slight difference in initial ball speed means the ball will land in a completely different slot.

So how can we exploit this idea to generate pseudorandom numbers?A big number (corresponding to the circumference of the wheel) times a big number(the number of trips made around the wheel before the ball comes to rest) yields a very big number (the total distance that the ball travels). Adding this distance to the starting point (the release point of the ball) determines exactly where the ball will end up. Taking the remainder of this total with respect to the wheel circumference determines the final position of the ball by subtracting all the loops made around the wheel by the ball.

The above analogy makes the appearance of mod operator in LCG equation obvious.

One does not need to know much about Jai-alai to appreciate the modeling aspects of the game and statistical techniques mentioned in the book. In fact this book is a classic story of how one goes about modeling a real life scenario and profiting from it.

image

With total silence around me and my mind wanting to immerse in a book, I picked up this book from my inventory. I came across a reference to this work in Aaron Brown’s book on Risk Management.

First something about the cover:

The young woman on the right is the classical Goddess Fortuna, whom today we might call Lady Luck. The young man on the left is Chance. Fortuna is holding an enormous bunch of fruits, symbolizing the good luck that she can bring. But notice that she has only one sandal. That means that she can also bring bad luck. And she is sitting on a soap bubble! This is to indicate that what you get from luck does not last. Chance is holding lottery tickets. Dosso Dossi was a court painter in the northern Italian city of Ferrara, which is near Venice . Venice had recently introduced a state lottery to raise money. It was not so different from modern state-run lotteries, except that Venice gave you better odds than any state-run lottery today. Art critics say that Dosso Dossi believed that life is a lottery for everyone. Do you agree that life is a lottery for everyone? The painting is in the J. Paul Getty Museum, Los Angeles, and the above note is adapted from notes for a Dossi exhibit, 1999.

The chapter starts with a set of 7 questions and hit is suggested that readers solve them before proceeding with the book.

Logic

The first chapter deals with some basic terminology that logicians use. The following terms are defined and examples are given to explain each of them in detail:

  • Argument: A point or series of reasons presented to support a proposition which is the conclusion of the argument.
  • Premises + Conclusion: An argument can be divided in to premises and a conclusion.
  • Propositions: Premises and conclusion are propositions, statements that can be either true or false.
  • Validity of an argument: Validity has to do with the logical connection between premises and conclusion, and not with the truth of the premises or the conclusion. If the conclusion is false, irrespective of whether the premises are true or false, we have an invalid argument.
  • Soundness of an argument: Soundness for deductive logic has to do with both validity and the truth of the premises.
  • Validity vs. Truth: Validity is not truth. It takes premises as true and proceeds to check the validity of a conclusion. If the premises are false, the reasoning can still be valid but not the TRUTH.

Logic is concerned only with the reasoning. Given the premises, it can tell you whether the conclusion is valid or not. It cannot say anything about the veracity of the premises. Hence there are two ways to criticize a deduction: 1) A premise is false, 2) The argument is invalid. So there is a division of labor. Who is an expert on the truth of premises? Detectives, nurses, surgeons, pollsters, historians, astrologers, zoologists, investigative reporters, you and me. Who is an expert on validity? A logician.

The takeaway of the chapter is that valid arguments are risk-free arguments, i.e. given the true premise; you arrive at a valid conclusion

Inductive Logic

The chapter introduces risky-arguments and inductive logic as a mechanism for reasoning. Valid arguments are risk-free arguments. A risky argument is one that is very good, yet its conclusion can be false, even when the premises are true. Inductive logic studies risky arguments. There are many forms of risky arguments like making a statement on population from a statement on sample, making a statement of sample from a statement on population, making a statement on a sample based on statement on another sample etc. Not all these statements can be studied via Inductive logic. Also, there may be more to risky arguments than inductive logic. Inductive logic does study risky arguments— but maybe not every kind of risky argument. The terms introduced in this chapter are

  • Inference to the best explanation
  • Risky Argument
  • Inductive Logic
  • Testimony
  • Decision theory

The takeaway of the chapter is that Inductive logic analyzes risky arguments using probability ideas.

The Gambler’s fallacy

This chapter talks about the gambler’s fallacy who justifies his betting on a red slot roulette wheel; given that last X outcomes on the wheel have been black. His premise is that the wheel is fair, but his action is against the premise where he is questioning the independence of outcomes. Informal Definitions are given for bias, randomness, complexity and no regularity. Serious thinking about risks, which uses probability models, can go wrong in two very different ways. 1) The model may not represent reality well. That is a mistake about the real world. 2) We can draw wrong conclusions from the model. That is a logical error. Criticizing the model is like challenging the premises. Criticizing the analysis of the model is like challenging the reasoning.

Elementary Probability Ideas

This chapter introduces some basic ideas of events, ways to compute probability of compound events etc. The chapter also gives an idea of the different terminologies used by statisticians and logicians, though they mean the same thing. Logicians are interested in arguments that go from premises to conclusions. Premises and conclusions are propositions. So, inductive logic textbooks usually talk about the probability of propositions. Most statisticians and most textbooks on probability talk about the probability of events. So there are two languages of probability. Why learn two languages when one will do? Because some students will talk the event language, and others will talk the proposition language. Some students will go on to learn more statistics, and talk the event language. Other students will follow logic, and talk the proposition language. The important thing is to be able to understand anyone who has something useful to say.

Conditional Probability

This chapter gives formulae for computing conditional probabilities. All the conditioning is done for a discrete random variable. Anything more sophisticated than a discrete RV would have alienated non-math readers of the book. A few examples are given to solidify the notions of conditional probability.

The Basic Rules of Probability & Bayes Rule

Rules of probability such as normality, additivity, total probability, statistical independence are explained via visuals. I think this chapter and previous three are geared towards a person who is a total novice in probability theory. The book also gives an intuition in to Bayes rule using elementary examples that anyone can understand. Concepts such as reliability testing are also discussed.

How to combine Probabilities and Utilities?

There are three chapters under this section. The chapter on expected value introduces a measure of the utility of a consequence and explores various lottery situations to show that cards are stacked against every lottery buyer and the lottery owner always holds an edge. The chapter on maximizing expected value says that one of the ways to choose amongst a set of actions is to choose the one that gives the highest expected value. To compute the expected value one has to represent the degrees of belief by probabilities and the consequences of action via utiles( they can be converted in to equivalent monetary units). Despite the obviousness of the expected value rule, there are a few paradoxes and those are explored in the chapter; the popular one covered being the Allais Paradox. All these paradoxes have a common message – The expected value rule does not factor in such attitudes as risk aversion and other behavioral biases and hence might just be a way to definite utilities in the first place. So, the whole expected value rule is not as water tight as it might seem. Also there are situations where decision theory cannot be of help. One may disagree about the probability of the consequences; one may also disagree about the utilities(how dangerous or desirable the consequences are). Often there is a disagreement about both probability and utility. Decision theory cannot settle such disagreements. But at least it can analyze the disagreement, so that both parties can see what they are arguing about. The last chapter in this section deals with decision theory. The three decision rules explained in the chapter are 1) Dominance rule 2) Expected value rule 3) Dominant expected value rule. Pascal’s wager is introduced to explain the three decision rules. The basic framework is to come up with a partition of possible states of affairs, possible acts that agents can undertake and utilities of the consequences of each possible act, in each possible state of affairs in the partition.

Kinds of Probability

What do you mean ?

This chapter brings out the real meaning of the word, “probability” and probably J the most important chapter of the book.

  1. This coin is biased toward heads. The probability of getting heads is about 0.6.
  2. It is probable that the dinosaurs were made extinct by a giant asteroid hitting the Earth.
    1. The probability that the dinosaurs were made extinct by a giant asteroid hitting the Earth is very high— about 0.9.
  3. Taking all the evidence into consideration, the probability that the dinosaurs were made extinct by a giant asteroid hitting the Earth is about 90%.
  4. The dinosaurs were made extinct by a giant asteroid hitting the Earth.

Statements (1) and (4) [but not (3)] are similar in one respect. Statement (4), like (1), is either true or false, regardless of what we know about the dinosaurs. If (4) is true, it is because of how the world is, especially what happened at the end of the dinosaur era. If (3) is true, it is not true because of “how the world is,” but because of how well the evidence supports statement (4). If (3) is true, it is because of inductive logic, not because of how the world is. The evidence mentioned in (3) will go back to laws of physics (iridium), geology (the asteroid), geophysics, climatology, and biology. But these special sciences do not explain why (3) is true. Statement (3) states a relation between the evidence provided by these special sciences, and statement (4), about dinosaurs. We cannot do experiments to test (3). Notice that the tests of (1) may involve repeated tosses of the coin. But it makes no sense at all to talk about repeatedly testing (3). Statement (2.a) is different from (3), because it does not mention evidence. Unfortunately, there are at least two ways to understand (2.a). When people say that so and so is probable, they mean that relative to the available evidence, so and so is probable. This the interpersonal/ evidential way. The other way to understand(2.a) is based on Personal sense of belief.

Statement (4) was a proposition about dinosaur extinction; (2 ) and (3) are about how credible (believable) (4) is. They are about the degree to which someone believes, or should believe, (4). They are about how confident one can or should be, in the light of that evidence.The use of word probability in statements(2) and (3) are related to the ideas such as belief, credibility, confidence, evidence and general name used to describe them is “Belief-type probability”

In contrast, The truth of statement(1) seems to have nothing to do with what we believe. We seem to be making a completely factual statement about a material object, namely the coin (and the device for tossing it ). We could be simply wrong, whether we know it or not . This might be a fair coin, and we may simply have been misled by the small number of times we tossed it. We are talking about a physical property of the coin, which can be investigated by experiment. The use of probability in (1) is related to ideas such as frequency, propensity, disposition etc. and the general name used to describe these is “frequency-type probability”

Belief-type probabilities have been called “epistemic”— from episteme, a Greek word for knowledge. Frequency-type probabilities have been called “aleatory,” from alea, a Latin word for games of chance, which provide clear examples of frequency-type probabilities. These words have never caught on. And it is much easier for most of us to remember plain English words rather than fancy Greek and Latin ones.

Frequency-type probability statements state how the world is. They state, for example, a physical property about a coin and tossing device, or the production practices of Acme and Bolt. Belief-type probability statements express a person’s confidence in a belief, or state the credibility of a conjecture or proposition in the light of evidence.

The takeaway from the chapter is that any statement with the word, probability carries two types of meanings, belief-type of frequency-type. It is important to understand the exact type of probability that is being talked about in any statement.

Theories about Probability

The chapter describes four theories of probability,

  1. Belief type – Personal Probability
  2. Belief type – Logical Probability – Interpersonal /Evidential probability
  3. Frequency type – Limiting frequency based
  4. Frequency type – Propensity based

Probability as Measure of Belief

Personal Probabilities

This chapter explains the way in which degrees of belief can be represented as betting rates or odds ratio. Let’s say my friend and I enter in to a bet about an event A, let’s say, “India wins the next cricket world cup“. If I think that India is 3 times more likely to win than to lose, then to translate this belief in to bet, I would invite my friend to take part in a bet where the total stake amount is 4000(Rs). My friend has agreed to bet 1000 Rs AGAINST the event and I should take the other side of the bet by offer 3000 Rs. Why is this bet according to my beliefs? My expected payoff is (1000*3/4)+(-3000*1/4=0. My friend’s expected payoff is (-1000*3/4)+(3000*1/4) = 0. Hence from my point of view it is a fair bet. There can be a bet ON the event too. I bet 3000 Rs on the event and my friend is on the other side of the bet with 1000Rs. This is again a fair bet from my belief system as my expected value is (1000*3/4)+(-3000*1/4) and my friend’s expected value is (1000*-3/4)+(3000*1/4). .By agreeing to place a bet on or against the event, my friend and I are quantifying out MY degree of belief in to betting fraction, i.e. my bet/total stake, my friend’s bet/total stake.

It is important to note that this might not be a fair bet according to my FRIEND’s belief system. He might be thinking that the event that “India wins the next cricketing world cup” has 50/50 chance. In that case, if my friend’s belief pans out, he will have an edge betting against the event and he will be at a disadvantage betting for the event. Why? In the former case, his expected payoff would be (-1000*1/2)+(3000*1/2) >0 and in the latter case, it would be (1000*1/2)+(-3000*1/2) <0. As you can see a bet in place means that the bet at least matches the belief system of one of the two players. Generalizing this to a market where investors buy and sell securities and there is a market maker, you get the picture that placing bets on securities is an act of quantifying the implicit belief system of the investors. A book maker / market marker never quotes fair bets, he always adds a component that keeps him safe, i.e., he doesn’t go bankrupt. The first ever example I came across in the context of pricing financial derivatives was in the book by Baxter and Rennie. Their introductory comments that describe arbitrage pricing and expectation pricing sets the tone for a beautiful adventure of reading the book.

The takeaway of this chapter is , 1) belief cannot be measured exactly, 2) you can think of artificial randomizers to calibrate degree of belief.

Coherence

This chapter explains that betting rates ought to satisfy basic rules of probability. There are three steps to proving this argument,

  1. Personal degrees of belief can be represented by betting rates.
  2. Personal betting rates should be coherent.
  3. A set of betting rates is coherent if and only if it satisfies the basic rules of probability.

Via examples, the chapter shows that any inconsistency in odds quoted for and against by a person will lead to arbitrate in gamble. Hence the betting fractions or the odds should satisfy basic rules of probability.

The first systematic theory of personal probability was presented in 1926 by F. P. Ramsey, in a talk he gave to a philosophy club in Cambridge, England. He mentioned that if your betting rates don’t satisfy the basic rules of probability, then you are open to a sure-loss contract. But he had a much more profound— and difficult— argument that personal degrees of belief should satisfy the probability rules. In 1930, another young man, the Italian mathematician Bruno de Finetti, independently pioneered the theory of personal probability. He invented the word “coherence,” and did make considerable use of the sure-loss argument.

Learning from Experience

This chapter talks about the application of Bayes rule. It’s basically a way to combine personal probability and evidence to get a handle of an updated personal probability. The theory of personal probability was independently invented by Frank Ramsey and Bruno De Finetti. But the credit of the idea— and the very name “personal probability”— goes to the American statistician L. J. Savage (1917– 1971). He clarified the idea of personal probability and combined it with Bayes’ Rule. The chapter also talks about contributions of various statisticians/scientists such as Richard Jeffrey, Harold Jeffrey, Rudolf Carnap, and L.J. Savage, and I.J.Good.

Probability as Frequency

The four chapters under this section explore frequentist ideas. It starts off by describing some deductive connections between probability rules and our intuitions about stable frequencies. Subsequently, a core idea of frequency-type inductive inference— the significance idea is presented. The last chapter in the section presents a second core idea of frequency-type inductive inference— the confidence idea. This idea explains the way opinion polls are now reported. It also explains how we can think of the use of statistics as inductive behavior. Basically all the chapters give a crash course on classical statistics without too much of math.

Probability applied to Philosophy

The book introduces David Hume’s idea that there is no justification for inductive inferences. Karl Popper, another philosopher agreed with Hume but held the view that it doesn’t matter as inductive inferences are invalid. According to Popper, “The only good reasoning is deductively valid reasoning. And that is all we need in order to get around in the world or do science”. There are two chapters that talk about evading Hume’s problem, one via Bayesian evasion(argues that Bayes’ Rule shows us the rational way to learn from experience) and the other one via Behavior evasion(argues that although there is no justification for any individual inductive inference there is still a justification for inductive behavior).

The Bayesian’s response to Hume is :

Hume, you’re right. Given a set of premises, supposed to be all the reasons bearing on a conclusion, you can form any opinion you like. But you’re not addressing the issue that concerns us! At any point in our grown-up lives (let’s leave babies out of this), we have a lot of opinions and various degrees of belief about our opinions. The question is not whether these opinions are “rational.” The question is whether we are reasonable in modifying these opinions in the light of new experience, new evidence. That is where the theory of personal probability comes in. On pain of incoherence, we should always have a belief structure that satisfies the probability axioms. That means that there is a uniquely reasonable way to learn from experience— using Bayes’ Rule.

The Bayesian evades Hume’s problem by saying that Hume is right. But, continues the Bayesian, all we need is a model of reasonable change in belief. That is sufficient for us to be rational agents in a changing world.

The frequentist response to Hume is:

We do our work in two steps: 1) Actively interfering in the course of nature, using a randomized experimental design.2) Using a method of inference which is right most of the time— say, 95% of the time. Frequentist says: “ Hume you are right , I do not have reasons for believing any one conclusion. But I have a reason for using my method of inference, namely that it is right most of the time.”

The chapter ends with a single-case objection and discusses the arguments used by Charles Sanders Pierce. In essence, the chapter under this section point to the conclusion of Pierce:

  • An argument form is deductively valid if the conclusion of an argument of such a form is always true when the premises are true.
  • An argument form is inductively good if the conclusion of an argument of such a form is usually true when the premises are true.
  • An argument form is inductively 95% good if the conclusion of an argument of such a form is true in 95% of the cases where the premises are true.

 

imageTakeaway :

The field of probability was not discovered; rather, it was created by the confusion of two concepts. The first is the frequency with which certain events recur, and the second is the degree of belief to attach to a proposition. If you want to understand these two schools of from a logician’s perspective and get a grasp on various philosophical takes on the word, “probability”, then this book is a suitable text as it gives a thorough exposition without too much of math.

image

This book is about a set of letters exchanged between Pascal and Fermat in the year 1654 that led to a completely different way of looking at future. The main content of the letters revolved around solving a particular problem, called “problem of points”. A simpler version of the problem goes like this:

Suppose two players—call them Blaise and Pierre—place equal bets on who will win the best of five tosses of a fair coin. They start the game, but then have to stop before either player has won. How do they divide the pot? If each has won one toss when the game is abandoned after two throws, then clearly, they split the pot evenly, and if they abandon the game after four tosses when each has won twice, they do likewise. But what if they stop after three tosses, with one player ahead 2 to 1?

It is not known how many letters were exchanged between Pascal and Fermat to solve this problem, but the entire correspondence took place in 1654. By the end of it, Pascal and Fermat had managed to do what was unthinkable till then – “Predict the future”, more importantly act based on predicting the future.

Pascal tried to solve the problem using recursion whereas Fermat did it in a simpler way,i.e. by enumerating the future outcomes, had the game continued. The solution gave rise to a new way of thinking and it is said that this correspondence marked the birth of risk management, as we know today.

The book is not so much as an analysis of the solution(as the author believes that today, anyone who has had just a few hours of instruction in probability theory can solve the problem of the points with ease) but more about the developments leading to 1654 and developments after the 1654. In the process, the book recounts all the important personalities who played a role in making probability from a gut based discipline to a rigorous mathematical discipline. The book can be easily read in an hour’s time and could have been a blog post.

image

I had been intending to read this book for many months but somehow never had a chance to go over it. Unfortunately I fell sick this week and lacked strength to do my regular work. Fortunately I stumbled on to this book again. So, I picked it up and read it cover to cover while still getting over my illness.

One phrase summary of the book is “Develop Bayesian thinking”. The book is a call to arms for acknowledging our failures in prediction and doing something about it. To paraphrase author,

We have a prediction problem. We love to predict things and we aren’t good at it

This is the age of “Big Data” and there seems to be a line of thought that you don’t need models anymore since you have the entire population with you. Data will tell you everything. Well, if one looks at classical theory of statistics where the only form of error that one deals with is the “sampling error”, then the argument might make sense. But the author warns against this kind of thinking saying that, “the more the data, the more the false positives”. Indeed most of the statistical procedures that one come across at the undergrad level are heavily frequentist in nature. It was relevant to an era where sparse data needed heavy assumption laden models. But with huge data sets, who needs models/ estimates? The flip side to this is that many models fit the data that you have. So, the noise level explodes and it is difficult to cull out the signal from the noise. The evolutionary software installed in a human’s brain in such that we all love prediction and there are a ton of fields where it has failed completely. The author analyzes some domains where predictions have failed, some domains where predictions have worked and thus gives a nice compare and contrast kind of insight in to the reasons for predictive efficiency. If you are a reader who is never exposed to Bayesian thinking, my guess is, by the end of the book, you will walk away being convinced that Bayes is the way to go or at least having Bayes thinking is a valuable addition to your thinking toolkit.

The book is organized in to 13 chapters. The first seven chapters diagnose the prediction problem and the last six chapters explore and apply Bayes’s solution. The author urges the reader to think about the following issues while reading through various chapters:

  • How can we apply our judgment to the data without succumbing to our biases?
  • When does market competition make forecasts better- and how can it make them worse?
  • How do we reconcile the need to use the past as a guide with our recognition that the future may be different?


A Catastrophic failure of prediction(Recession Prediction)

Financial Crisis has lead to a boom in one field – “books on financial crisis”. Since the magnitude of impact was so large, everybody had something to say. In fact during the first few months post 2008, I had read at least half a dozen books and then gave up when every author came up with almost similar reasons why such a thing happened? There was nothing to read but books on crisis. Some of the authors even started writing books like they were some crime thrillers. In this chapter, the author comes up with almost the same reasons for the crisis that one has been bombarded earlier

  • Homeowners thought their house prices will go up year after year.
  • Rating agencies had faulty models with faulty risk assumptions.
  • Wall Street took massive leverage bits on housing sector and the housing crisis turned in to a financial crisis.
  • Post crisis, there was a failure to predict the nature and extend of various economic problems.

However the author makes a crucial point that in all of the cases, the prediction were made “Out of sample”. This is where he starts making sense.

  • IF the homeowners had a prior that house prices may fall, they would have behaved differently
  • IF the models had some prior on correlated default behavior, then models would have brought some sanity in to valuations.
  • IF the Wall Street had Bayesian risk pricing, the crisis would have been less harsher
  • IF the post crisis scenarios had sensible priors for forecasting employment rates etc., then policy makers would have been more prudent.

As you can see, there is a big “IF”, which is usually a casualty when emotions run wild, when personal and professional incentives are misaligned and when there is a gap between what we know and what we think we know. All these conditions can be moderated by an attitudinal shift towards Bayesian thinking. Probably the author starts the chapter with this recent incident to show that our prediction problems can have disastrous consequences.

Are you smarter than a Television Pundit ?( Election Result Prediction)

How does Nate Silver crack the forecasting problem? This chapter gives a brief intro to Philip Tetlock’s study where he found hedgehogs fared worse than foxes. There is an interesting book that gives a detailed look at Philip Tetlock’s study titled Future Babble, that makes for quite an interesting read. Nate Silver gives three reasons why he has succeeded with his predictions:

  • Think Probabilistically
  • Update your Probabilities
  • Look for Consensus

If you read it from a stats perspective, then the above three reasons are nothing but, form a prior, update the prior and create a forecast based on the prior and other qualitative factors. The author makes a very important distinction between “objective” and “quantitative”. Often one wants to be former but sometimes end up being latter. Quantitative gives us many options based on how the numbers are made to look like. A statement on one time scale would be completely different on a different time scale. “Objective” means seeing beyond our personal biases and prejudices and seeing the truth or at least attempting to see the truth. Hedgehogs by their very nature stick to one grand theory of universe and selectively pick things to confirm to their theory. In the long run they lose out to foxes that are adaptive in nature and update the probabilities and do not fear making a statement that they don’t know something or they can only make a statement with a wide variability.

I have seen this Hedgehog Vs. Fox analogy in many contexts. Ricardo Rebanato has written an entire book about it saying volatility forecasting should be made like a fox rather than a hedgehog. In fact one of the professors at NYU said the same thing to me years ago,” You don’t need a PhD to do well in Quant finance, You need to be like a fox and comfortable with alternating hypothesis for a problem. Nobody cares whether you have a grand theory for success in trading or not. Only thing that matter is whether you are able to adapt quickly or not.”

One thing this chapter made me think was about the horde of equity research analysts that are on the Wall Street, Dalal Street and everywhere. How many of them have a Bayesian model of whatever securities they are investing? How many of them truly update the probabilities based on the new information that flows in to the market? Do they simulate for various scenarios? Do they active discuss priors and the various assigned probabilities? I don’t know. However my guess is only a few do as most of the research reports that come out contain stories, spinning yarns around various news items, terrific after the fact analysis but terrible before the act statements.

All I care about is W’s and L’s( Baseball Player Performance Prediction)

If you are not a baseball fan but have managed to read “Money ball” or watched the same titled movie starring Brad Pitt, one knows that baseball as a sport has been revolutionized by stat geeks. In the Money ball era, insiders might have hypothesized that stats would completely displace scouts. But that never happened. In fact Billy Beane expanded the scouts team of Oakland A’s. It is easy to get sucked in to some tool that promises to be the perfect oracle. The author narrates his experience of building one such tool PECOTA. PECOTA crunched out similarity scores between baseball players using nearest neighbor algorithm, the first kind of algo that you learn in any machine learning course. Despite its success, he is quick to caution that it is not prudent to limit oneself to gather only quantitative information. It is always better to figure out processes to weigh the new information. In a way this chapter says that one cannot be blinded by a tool or a statistical technique. One must always weight every piece of information that comes in to the context and update the relevant probabilities.

The key is to develop tools and habits so that you are more often looking for ideas and information in the right places – and in honing the skills required to harness them in to wins and losses once you have found them. It’s hard work.(Who said forecasting isn’t?)

For Years You have been telling us that Rain is Green( Weather Prediction)

This chapter talks about one of the success stories in prediction business, “weather forecasting”. National Hurricane Center predicted Katrina five days before the levees were breached and this kind of prediction was unthinkable 20-30 years back. The chapter says that weather predictions have become 350% more accurate in the past 25 years alone.

The first attempt to weather forecasting was done by Lewis Fry Richardson in 1916. He divided the land in to a set of square matrices and then used the local temperature, pressure and wind speeds to forecast the weather in the 2D matrix. Note that this method was not probabilistic in nature. Instead it was based on first principles that took advantage of theoretical understanding of how the system works. Despite the seemingly commonsensical approach, Richardson method failed. There are couple of reasons, one Richardson’s methods required awful lot of work. By 1950, John Von Neumann made the first computer forecast using the matrix approach. Despite using a computer, the forecasts were not good because weather conditions are multidimensional in nature and analyzing in a 2D world was bound to fail. Once you increase the dimensions of analysis, the calculations explode. So, one might think with exponential rise in computing power, weather forecasting problem might have been a solved problem in the current era. However there is one thorn in the flesh, the initial conditions. Courtesy chaos theory, a mild change in the initial conditions gives rise to a completely different forecast at a given region. This is where probability comes in. Meteorologists run simulations and report the findings probabilistically. When someone says there is 30% chance of rain, it basically means that 30% of their simulations showed a possibility of rain. Despite this problem of initial conditions, weather forecasting and hurricane forecasting have vastly improved in the last two decades or so. Why? The author gives a tour of World Weather office in Maryland and explains the role of human eyes in detecting patterns in weather.

In any basic course on stats, a healthy sense of skepticism towards human eyes is drilled in to students. Typically one comes across the statement that human eyes are not all that good at figuring out statistically important patterns, i.e. pick signal from noise. However in the case of weather forecasting, there seems to be tremendous value for human eyes. The best forecasters need to think visually and abstractly while at the same time being able to sort through the abundance of information that the computer provides with.

Desperately Seeking Signal ( Earthquake Prediction)

The author takes the reader in to the world of earthquake prediction. An earthquake occurs when there is a stress in one of the multitude of fault lines. The only recognized relationship is the Gutenburg- Ritcher law where the frequency of earthquakes and the intensity of earthquakes form an inverse linear relationship on a log-log scale. Despite this well known empirical relationship holding good for various datasets, the problem is with temporal nature of the relationship. It is one thing to say that there is a possibility of earthquake in the coming 100 years and completely different thing to say that it is going to hit in between Xth and Yth years. Many scientists have tried working on this temporal problem. However a lot of them have called quits. Why? It is governed by the same chaos theory type dependency of initial conditions. However unlike the case of weather prediction where science is well developed, the science of earthquakes is surprisingly missing. In the absence of science, one turns to probability and statistics to give some indication for forecast. The author takes the reader through a series of earthquake predictions that went wrong. Given the paucity of data and the problem of over fitting, many predictions have gone wrong. Scientists who predicted that gigantic earthquakes would occur at a place were wrong. Similarly predictions where everything would be normal fell flat on the face when earthquakes wreathed massive destruction. Basically there has been a long history of false alarms.

How to Drown in Three Feet of Water(Economic variable Prediction)

The chapter gives a brief history of US GDP prediction and it makes abundantly clear that it has been a big failure. Why do economic variable forecasts go bad ?

  1. Hard to determine cause and effect
  2. Economy is forever changing
  3. Data is noisy

Besides the above reasons, the policy decision effect the economic variable at any point in time. Thus an economist has a twin job of forecasting the economic variable as well as policy. Also, the sheer number of economic indicators that come out every year is huge. There is every chance that some of the indicators might be correlated to the variable that is being predicted. Also it might turn out that an economic variable is a lagging indicator in some period and leading indicator in some other period. All this makes it difficult to cull out the signal. Most often than not the economist picks on some noise and reports it.

In one way, an economist is dealing with a system that has similar characteristics of a system dealt by meteorologist. Both weather and economy are highly dynamic systems. Both are extremely sensitive to initial conditions. However meteorologist has had some success mainly because there is some rock solid theory that helps in making predictions. Economics on the other hand is a soft science. So, given this situation, it seems like predictions for any economic variable are not going to improve at all .The author suggests two alternatives:

  1. Create a market for accurate forecasts – Prediction Markets
  2. Reduce demand for inaccurate and overconfident forecasts – Make margin of error reporting compulsory for any forecast and see to it that there is a system that records the forecast performance. Till date, I have never seen a headline till date saying ,” This year’s GDP forecast will be between X% and Y %”. Most of the headlines are point estimates and they all have an aura of absolutism. May be there is a tremendous demand for experts but we don’t have actually that much demand for accurate forecasts.

Role Models (Epidemiological predictions)

This chapter gives a list of examples where flu predictions turned out to be false alarms. Complicated models are usually targeted by people who are trying to criticize a forecast failure. In the case of flu prediction though, it is the simple models that take a beating. The author explains that most of the models used in flu prediction are very simple models and they fail miserably. Some examples of scientists trying to get a grip on flu prediction are given. These models are basically agent simulation models. However by the end of the chapter the reader gets a feeling the flu prediction is not going to easy at all. In fact I had read about Google using search terms to predict flu trends. I think the period was 2008. Lately I came across an article that said Google’s flu trend prediction was not doing that good!. Out of all the areas mentioned in the book, I guess flu prediction is the toughest as it contains multitude of factors, extremely sparse data and no clear understanding about how it spreads.

Less and Less and Less Wrong

The main character of the story in this chapter is Bob Voulgaris, a basketball bettor. His story is a case in point of a Bayesian who is making money by placing bets in a calculated manner. There is no one BIG secret behind his success. Instead there are a thousand little secrets that Bob has. This repertoire of secrets keeps growing day after day, year after year. There are ton of patterns everywhere in this information rich world. But whether the pattern is a signal or noise is becoming increasing difficult to say. In the era of Big Data, we are deluged with false positives. There is a nice visual that I came across that excellently summarizes the false positives of a statistical test. In one glance, it cautions us to be wary of false positives.

image

The chapter gives a basic introduction to Bayes thinking using some extreme examples like, what’s the probability that your partner is cheating on you ? If a mammogram shows gives a positive result, what’s the probability that one has a cancer ?, What’s the probability of a terrorist attack on the twin towers after the first attack? These examples merely reflect the wide range of areas where Bayes can be used. Even though Bayes theory was bought to attention in 1763, major developments in the field did not take place for a very long time. One of the reasons was Fisher, who developed frequentist way of statistics and that caught on. Fischer’s focus was on sampling error. In his framework , there can be no other error except sampling error and that reduces as sample size approaches the population size. I have read in some book that the main reason for popularity of Fisher’s framework was that it contained the exact steps that an scientist needs to follow to get a statistically valid result. In one sense, he democratized statistical testing framework. Fisher created various hypothesis testing frameworks that could be used directly by many scientists. Well, in the realm of limited samples, limited computing power, these methods thrived and probably did their job. But soon, frequentist framework started becoming a substitute for solid thinking about the context in which hypothesis ought to be framed. That’s when people noticed that frequentist stats was becoming irrelevant. In fact in the last decade or so, with massive computing power, everyone seems to be advocating Bayesian stats for analysis. There is also a strong opinion of replacing the frequentist methodologies completely by Bayesian Paradigm in the schooling curriculum.

Rage against the Machines

This chapter deals with chess, a game where initial conditions are known, the rules are known and chess pieces move based on certain deterministic constraints. Why is such a deterministic game appearing in a book about forecasting ? Well, the reason being that, despite chess being a deterministic game, any chess game can proceed in one of the 1010^50, i.e. the number of possible branches to analyze are more than the number of atoms in the world. Chess comprises of three phases, the opening game, the middle game and the end game. Computers are extremely good in the end game as there are few pieces on the board and all the search path of the game can be analyzed quickly. In fact all the end games with six or fewer pieces have been solved. Computers also have advantage in the middle game where the game complexity increases and the computer can search an enormously long sequence of possible steps. It is in the opening game that computers are considered relatively weak. The opening of a game is a little abstract. There might be multiple motives behind a move, a sacrifice to capture the center, a weak move to make the attack stronger etc. Can a computer beat a human ? This chapter gives a brief account of the way Deep Blue was programmed to beat Kasparov. It is fascinating to learn that Deep Blue was programmed in ways much like how a human plays a game. The banal process of trial and error. The thought process behind coding Deep Blue was based on questions like :

  • Does allotting the program more time in the endgame and less in the midgame improve performance on balance?
  • Is there a better way to evaluate the value of a knight vis-à-vis a bishop in the early going?
  • How quickly should the program prune dead-looking branches on its search tree even if it knows there is some residual chance that a checkmate or a trap might be lurking there?

By tweaking these parameters and seeing how it played with the changes, the team behind Deep Blue improved upon slowly and eventually beat Kasparov. I guess the author is basically trying to say that even in such deterministic scenarios, trial and error,fox like thinking is what made the machine powerful.

The Poker Bubble

This chapter is an interesting chapter where the author recounts his experiences with playing poker, not merely as a small time bystander but as a person who was making serious money in six figures in 2004 and 2005. So, here is a person who is not giving some journalistic account of the game. He has actually played the game, made money and he is talking about why he succeeded. The author introduces what he calls prediction learning curve where if you do 20% of things right, you get 80% of the times forecasts right. Doing this and making money in a game means there must be people who don’t do these 20% of the things right. In a game like poker, you can make money if there are enough suckers. Once the game becomes competitive and suckers are out of the game, the difference between an average player and an above average player in terms of their winning stakes is not much. In the initial years of Poker bubble, every person wanted to play poker and become rich quickly. This obviously meant that there were enough suckers in the market. The author says he was able to make money precisely because of the bubble. Once the fish were out of the game, it became difficult for him to make money and ultimately the author had to give up and move on. The author’s message is

It is much harder to be very good in fields where everyone else is getting the basics right—and you may be fooling yourself if you think you have much of an edge.

Think about stock market. As the market matures, the same lots of mutual fund managers try to win the long only game, the same options traders try to make money off the market. Will they succeed? Yes if there are enough fish in the market. No, if the game is played between almost equals. With equally qualified grads on the trading desks, with the same colocated server infra, can HFTs thrive ? May be for a few years but not beyond that, is the message from this chapter.

The author credits his success to picking his battles well. He went in to creating software for measuring and forecasting baseball player’s performance in the pre-money ball era. He played poker when there was a boom and where getting 20% of things right could reap good money for him. He went in to election outcome forecasting when most of the election experts were not doing any quantitative analysis. In a way, this chapter is very instructive for people trying to decide on the fields where their prediction skills can be put to use. Having skills alone is not enough. It is important to pick the right fields where one can apply those skills.

If you can’t beat ‘em(Stock Market Forecasting)

The author gives an account of a prediction markets site, Intrade run by a Wharton professor Justin Wolfers. These markets are the closest thing to Bayes land where if you have believe in certain odds and see that there is someone else having a different odds for the same event, you enter in to a bet and resolve the discrepancy. One might think that stock markets also perform something similar, where investors with different odds for the same event settle their scores by entering in to a financial transaction. However the price is not always right in the market. The chapter gives a whirlwind tour of Fama’s efficient market theory, Robert Shiller’s work, Henry Blodget’s fraud case etc. to suggest that market might be efficient in the long run but the short run is characterized by noise. Only a few players benefit in the short run and the composition of the pool changes from year to year. Can we apply Bayes thinking to markets ? Prediction markets are something that is close to Bayes land. But markets are very different. They have capital constraints, horizon constraints, etc. Thus even though your view is correct, the market can stay irrational for a longer time. So, applying Bayesian thinking to markets is a little tricky. The author argues that market is a two way track, one that is driven by fundamentals and pans out in the long run correctly, the second is a fast street that is populated by HFT traders, algo traders, noise traders, bluffers etc. According to the author, Life in the fast lane is high risk game that not many can play and sustain over a period of time.

A climate of healthy Skepticism(Climate Prediction)

This chapter talks about the climate models and the various uncertainties/issues pertinent to building such long range forecasting models.

What you don’t know can hurt you (Terrorism Forecasting)

This chapter talks about terrorist attacks, military attacks etc. and the contribution of having a Bayes approach. Post Sept 11, the commission report identified “failure of imagination” as one of the biggest failures. The Nationality security just did not imagine such a thing would happen. Basically they were completely blinded to a devastation of such scale. Yes, there were a lot of signals but all of them seem to make sense after the fact. The chapter mentions Aaron Clauset, a professor at the University of Colorado who compares a terrorist attack prediction to that of an earthquake prediction. One known tool in the earthquake prediction domain is the loglog scale plot of frequency to the intensity. In the case of terrorist attacks, one can draw such a plot to at least acknowledge that an attack that might kill a million Americans is a possibility. Once that is acknowledged the terrorist attacks falls under known unknown category and at least a few steps can be taken by national security and other agencies to ward off the threat. There is also a mention of Israeli approach to terrorism where the Israeli govt. makes sure that people get back to their normal lives soon after a bomb attack and thus reducing the “fear” element that is one of the motives of a terrorist attack.

imageTakeaway:

The book is awesome in terms of its sheer breadth of coverage. It gives more than a bird’s eye view of forecast / prediction performance in the following areas:

  • Weather forecasts
  • Earthquake forecasts
  • Chess strategy forecasts
  • Baseball player performance forecasts
  • Stock market forecasts
  • Economic variable forecasts
  • Political outcome forecasts
  • Financial crisis forecasts
  • Epidemiological predictions
  • Baseball outcome predictions
  • Poker strategy prediction
  • Climate prediction
  • Terrorist attack prediction

The message from the author is abundantly clear to any reader at the end of the 500 pages. There is a difference between what we know and what we think we know. The strategy to closing the gap is via Bayesian thinking. We live in an incomprehensibly large universe. The virtue in thinking probabilistically is that you will force yourself to stop and smell the data—slow down, and consider the imperfections in your thinking. Over time, you should find that this makes your decision making better. “Have a prior, collect data, observe the world, update your prior and become a better fox as your work progresses” is the takeaway from the book.

image

Elements of Statistical Learning” (ESL) is often referred to as the “bible” for any person interested in building statistical models. The content in ESL is dense and the implicit prerequisites are good background in linear algebra, calculus and some exposure to statistical inference and prediction. Also, the visuals in ESL come with no elaborate code. If you don’t want to take them at face value and would like to check the statements / visuals in the book, you have got to sweat it out.

This book has four co-authors and two of them are coauthors of ESL, Trevor Hastie and Robert Tibshirani . Clearly the intent of the book is to serve as a primer to ESL. The authors have selected a few models from each of the ESL topics and have presented in such a way that anybody with an inclination to model can understand the contents of the book. The book is an amazing accomplishment by the authors as the book does not require any prerequisites such as linear algebra, calculus etc. Even the knowledge of R is not all that a prerequisite. However if you are not rusty with matrix algebra, calculus, probability theory , R etc., you can breeze through the book and get a fantastic overview of statistical learning.

I happened to go over this book after I had sweated my way through ESL. So, reading this book was like reiterating the main points of ESL. The models covered in ISL are some of the important ones from ESL. The biggest plus of this book is the generous visuals sprinkled throughout the book. There are 145 figures in a 440 page book, i.e. an average of one visual for every three pages. That in itself makes this book a very valuable resource. I will try to summarize some of the main points from ISL.

Introduction

Statistical Learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. The difference between the two is that in the former, there is a response variable that guides the modeling effort whereas in the latter there is none. The chapter begins with a visual exploration of three datasets; the first dataset is wage data set that is used to explore the relationship between wage and various factors, the second data set is the stock market returns data set that is used to explore the relationship between the up and down days of returns with that of various return lags, and the third dataset is a gene expression dataset that serves an example of unsupervised learning. They try to summarize the gamut of statistical modeling techniques available for a modeler.There are 15 datasets used throughout the book. Most of them are present in ISL package on CRAN.

Some historical background for Statistical Learning

  • 19th century – Legendre and Gauss method of least squares
  • 1936 – Linear Discriminant Analysis
  • 1940 – Logistic Regression
  • 1970 – Generalized Linear Model – Nelder and Wedderburn
  • 1980s – By 1980s computing technology improved Breiman, Friedman, Olshen and Stone introduced classification and regression trees. Also principles of cross-validation were laid out.
  • 1986 – Hastie and Tibshirani coined the term generalized additive models in 1986 for a class of non-linear extensions to generalized linear models, and also provided a practical software implementation.
  • In the recent years, the growth of R has made the availability of these techniques to a wide range of audience.

Statistical Learning

While building a model, it is helpful to keep in mind the purpose of the model. Is it for inferential purpose? Is it for prediction? Is it for both? If the purpose is to predict, then the exact functional form is not of interest as long as the prediction does a good job. If the model building is targeted towards inference, then exact functional form is of interest as it says a couple of things :

  • What variables amongst the vast array of variables are important?
  • What is the magnitude of this dependence?
  • Is the relationship a simple linear relationship or a complicated one?

In the case of a model for inference, a simple linear model would be great as things are interpretable. In the case of prediction, non linear methods are good too as long as they give a good sense of predictability. Suppose that there is a quantitative response clip_image001 and there are p different predictors,clip_image002 . One can test out a model that build a relationship between clip_image001[1] and clip_image002[1] . The model would then be

clip_image003

where clip_image004 represents the systematic information that X provides about y.

The model estimate is represented by clip_image005. There are two types of errors, one called the reducible error and the other called the irreducible error.

image

The above can further be broken down

image

So, whatever method you choose for modeling, there will be a seesaw between bias and variance. If you are interested in inference from a model, then linear regression, lasso, subset selection have medium to high interpretability. If you are interested in prediction from a model, then methods like GAM, Bagging, Boosting, SVM approaches can be used. The authors give a good visual that summarizes this tradeoff between interpretability and prediction for various methodologies.

The following visual summarizes the position of various methods on bias variance axis.

 

clip_image009

When inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. Like in the case of an algo trading model, it does not matter what the exact form of the model is as long as it is being able to predict the moves correctly. There is a third variation of learning besides the supervised and unsupervised learning. Semi supervised learning is where you have response variable data for a subset of your entire data but is not present for the rest of data.

The standard way to measure model fit is via mean squared error

image

This error is typically less on training dataset. One needs to check the statistic on the test data and choose that model that gives the least test MSE. There is no guarantee that the method that has the least error on the training data will also have the least error on the test data.

I think the following statement is the highlight of this chapter:

Irrespective of the method used, as the flexibility of the method increases, it results in a monotonically decreasing error for the training data and U shaped curve for the test data for the test data

clip_image012

Why do we see an inverted U curve? As more flexible methods are explored, variance will increase and bias will go down. As you start with more flexible methods, the bias starts to decrease more than the increase in variance. But beyond a point, the bias hardly changes and variance begins to increase. This is the reason for the U curve for the test MSE.

In the classification setting, the golden standard for evaluation is the conditional probability

image

for various classes. This simple classifier is called the Bayesian classifier. The Bayes classifier’s prediction is determined by the Bayes decision boundary. The Bayes classifier provides the lowest possible rate. Since the Bayes classifier will always choose the class for which the conditional probability is the largest, the error rate has a convenient form.The Bayes error rate is analogous to the irreducible error. This connection between Bayes error rate and irreducible error is something that I had never paid attention to. For real data, one does not know the conditional probability. Hence computing Bayesian classifier is impossible. In any case Bayes classifier stands as the gold standard against which to compare methods, for simulated data. If you think about the K nearest neighbor methods, as the number of neighbors increases, the bias goes up and variance goes down. For a neighbor of size 1, there is no bias but a lot of variance. Thus if you are looking for a parameter to characterize the increasing flexibility, then 1/K is a good choice. Typically you see a U curve for the test data and then you can zoom in on to the K value that yields minimum MSE.

Linear Regression

This chapter begins with a discussion about the poster boy of statistics, linear regression. Why go over linear regression? The chapter puts it aptly,

Linear regression as a good jumping off point for newer approaches.

This chapter teaches one to build a linear models with hardly mention on linear algebra. Right amount of intuition and right amount of R code drives home the lesson. A case in point is illustrating the case of spurious regression via this example:

Assuming that you are investigating shark attacks. You have two variables ice cream sales and temperature. If you regress shark attacks with ice cream sales, you might see a statistical relationship. If you regress shark attacks with temperature, you might see a statistical relationship. If you regress shark attacks with ice cream sales and temperature, then you might see ice cream sales variable being knocked off? Why is that?

Let’s say there are two correlated predictors and one response variable. If you regress the response variable with each of the individual predictors, the relationship might show a statistically significant relationship. However if you include both the predictors in a multiple regression model, one might see that only one of the variables is significant. In a multiple regression scenario, some of the correlated variables get knocked off for the simple reason that the effect of the predictor is measured keeping all else constant. If you want to estimate the coefficient of ice cream sales, you keep the temperature constant and check the relationship between the shark attacks and ice cream sales. You will immediately see that there is no relationship whatsoever. This is a nice example (for class room purpose) that drives home the point that one must be careful about correlation between predictors.

Complete orthogonality amongst predictors is an illusion. In reality, there will always be a correlation between predictors. What are assumptions behind a linear model?. Any simple linear models has two assumptions, one being additivity, the other being linearity. In additivity assumption, the effect of changes in a predictor clip_image001[6] on the response clip_image002[6] is independent of the values of the other predictors.In linearity assumption, the change in the response clip_image002[7] due to a one-unit change in clip_image001[7] is constant, regardless of the value of clip_image001[8].

Why are prediction intervals wider than confidence intervals ? Something that is not always given a nice explanation in words. The chapter puts it nicely. Given a model you will be able to predict the response up to irreducible error. What this means is that even if you know the true population parameters, you model says that there is an irreducible error. You can’t escape this error. For a given set of covariates, one can form two sets of confidence bands. The first type of confidence bands is called confidence intervals where the bands are for training data. The prediction interval is wider than the confidence interval as it also includes the systematic error. Confidence interval tells what happens on an average whereas prediction interval tells what happens for a specific data point.

Basic inference related questions

  • Is at least one of the predictors useful in predicting response ? The classic way to test this do an anova so that it throws the relevant statistic. The summary table of the linear regression reports p values that are partialled out effects. So, in the contexts of fewer predictors, it is ok to see the pvalue and be happy with it. However if there are large number of predictors, by sheer randomness some of the coefficients can show statistical relationship. In all such cases, one needs to look at the F statistic.
  • Do all the predictors help explain clip_image002[8]? Forward selection, Backward selection or Hybrid methods can be used.
  • How well does the model fit the data Mallows clip_image003[4], AIC, BIC, Adjusted clip_image004[4]. These are all the metrics that give an estimate of test error based on training error
  • Given a set of predictors, what response values should we predict and how accurate is our prediction?

Clearly there are a ton of aspects that go wrong when you fit real life data, some of them are :

  • Non linearity of the response-predictor relationships – Residual Vs Fitted plots might give an indication of this. transforming the independent variables can help.
  • Correlation of error terms: In this case the error matrix has a structure and hence you can use generalized least squares to fit a model. Successive terms can be correlated. This means that you have to estimate the correlation matrix if truly has that assumed correlation This is an iterative procedure. Either you can code it up or conveniently use nlme package.
  • Non constant variance of error terms – If you see a funnel shaped graph for variance of error terms, then a log transformation of response variable can be used
  • Outliers in the model bloat the RSE and hence all the interpretations of confidence intervals become questionable – Plot studentized residuals to cull out outliers
  • High Leverage points : Outliers are more relevant to unusual response variable values where leverage points are unusual covariate data points.There are standard diagnostic plots to identify these miscreants.
  • Collinearity : Because of collinearity, for similar value of RSS, different estimates of coefficients can be chosen. The joint distribution is a narrow valley and hence the parameter estimates are very unstable. This basically has to do with design matrix not being a full column rank. clip_image005[4]might be almost singular matrix. Hence some of the eigen values of the matrix could be very close to 0. Hence the best way to understand collinearity is to compute variance inflation factor. faraway package has a function vif that automatically gives vif for all the predictors. Well , nothing much can be done to reduce the collinearity as the data has already been collected. All one can do is to drop the collinear column or take a linear combination of such columns and replace the columns by a single column.

The chapter ends with a section on KNN regression. KNN regression is compared with linear regression to comment on the non parametric vs. parametric modeling approaches. KNN regression works well if the true model is non linear. But if the number of dimensions increases, this method falters. Using the bias variance trade off graph for training and test data given in the chapter , one can see that KNN performs badly if the true data is linear and K is small. If K is very large and the true model is linear, then KNN method works as well as a parametric model.

The highlight of this chapter as well as this book is readily available visuals that are given to the reader. All you have to do is study them . If you start with ESL instead of this book, you have to not only know the ingredients but also code it up and check all the statements for yourself. This book makes life easy for the reader.

Classification

I sweated through the chapter on classification in ESL. May be that’s why, I found this chapter on classification a welcome recap of all the important points and in fact a pleasant experience to read through. The chapter starts off with illustrating the problem of using a linear regression for a categorical variable. Immediately it shows that a variation of linear model, i.e. logit model . However if the categorical variable has more than 2 levels, then the performance of multinomial logit starts to degrade.

The chapter gives a great example of paradox. The dataset used to illustrate this is the one that contains default rate, credit card balances, income and a categorical variable student. If you regress the default rate against student variable, you find that the variable is significant. If you regress default rate with balance and student variable, the student variable is kicked out. This is similar to shark attacks-temperature-ice-cream sales case. There is an association between the balance and default rate. The fact that student comes up as a statistically significant variable is due to the fact that the student variable is correlated with balance variable. Here is a nice visual to illustrate confounding.

image

Visual on the left says at any given level the default rate of student is less than that of non-student. However average default rate of student is more than the average default rate of non-student. This is mainly because of the confounding variable. If you look at the visual on the right, the students typically carry a lot of balance and that is variable that is driving the default rate. The conclusions from this is : On an average students are risky than non-students if you don’t have any information. But given a specific balance, the default rate of non-students is more than students. I am not sure but I think this goes by the name of “ Simpson’s Paradox”

In cases where there are more than two levels for the categorical variable, it is better to go for a classification techniques. What’s the basic idea? You model the distribution of the predictors X separately in each of the response classes and then use Bayes to flip these around in to estimates for posterior probabilities of the response variable belonging to a specific class. LDA and QDA are two popular techniques for classification. LDA assumes a common covariance structure of the predictors for each of the separate levels of the response variable. It then uses Bayes to come up with a linear discriminant function to bucketize each observation. Why prefer LDA to Logistic in certain cases?

  • When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. LDA does not suffer from this problem
  • If sample size is small and the distribution of the predictors is approximately normal in each of the classes, the LDA is again more stable than logistic regression model
  • LDA is popular when there are more than two responses

QDA assumes separate covariance matrix for each class and is useful when there is a lot of data and variance is not of great concern. In LDA, one assumes same covariance matrix and such an assumption is useful when the dataset is small and there is a price to pay for estimating clip_image003[6] parameters. If sample size is small LDA is a better bet than QDA. QDA is recommended if the training set is large, so that the variance of the classifier is not a major concern

There are six examples that illustrate the following points

  • When the true decision boundaries are linear, then LDA and logistic regression approaches will tend to perform well
  • When boundaries are moderately non-linear, QDA may give better results
  • For much more complicated boundaries, a non parametric approach such as KNN can be superior. But the level of smoothness for non parametric method needs to be chosen carefully.
  • Ideally one can think of QDA as the middle ground between LDA and KNN type methods

The lab at the end of the chapter gives the reader R code and commentary to fit logit models, LDA model, QDA model, knn model. As ISL is a stepping stone for ESL, it is not surprising that many other aspects of classification are not mentioned. A curious reader can find regualarized discriminant analysis, connection between QDA and Fischer’s discriminant analysis and many more topics in ESL. However in the case of ESL, there is no spoon feeding. You have to sweat it out to understand. In ESL, there is a mention of an important finding. Given that LDA and QDA make such a harsh assumption on the nature of the data, it is surprising to know that they are two most popular practitioner’s methods.

Resampling Methods

This chapter talks about Resampling methods, the technique that got hooked me on to statistics. It is an indispensable tool in modern statistics. This chapter talks about two most commonly used resampling techniques : cross-validation and bootstrap

Why cross-validation ?

  • Model Assessment : It is used to estimate the test error associated with a given statistical learning method in order to evaluate its performance.
  • Model Selection : Used to select the proper level of flexibility. Let’s say you are using KNN method and you need to select the value of K. You can do a cross validation procedure to plot out the test error for various values of K and then choose the K that has the minimum test error rate.

The chapter discusses three methods, one being validation set approach, the second one being LOOCV( Leave one out cross validation ) and k-fold cross validation method. In the first approach, a part of data set is set aside for testing purpose. Flexibility parameter is chosen in such a way that MSE is minimum for that level of flexibility.

What’s the problem with Validation set approach?

  • There are two problems with validation set approach. The test error estimate has a high variability as it depends on how you choose your training and test data. Also in the validation set approach, only part of the data is used to train the model. Since the statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate.
  • LOOCV has some major advantages. It has less bias. It uses almost the entire data set. It does not overestimate the test error rate. With least square or polynomial regression, LOOCV has an amazing formula that cuts down the computational time. It is directly related to hat values and residual values. But there is no readymade formula for other models and thus LOOCV can be computationally expensive. As an aside if you use LOOCV to choose the bandwidth in Kernel density estimation procedure, all the hard work of that FFT does is washed away. It is no longer O(n log n).
  • k fold CV is far better from a computational stand point. It has some bias as compared LOOCV but it has lesser variance that LOOCV based estimate. So, in the gamut of less bias more variance- more bias less variance, a 5 fold CV or a 10 fold CV has been shown to yield test error rates that do not suffer from excessively high bias or high variance.

Most of the times we does not have the true error rates. Nor do we have the Bayesian boundary. We are testing out models with various flexibility parameters. One way to choose the flexibility parameter is to estimate test error rate using k fold CV and choose the flexibility parameter that has the lowest test error rate. As I have mentioned earlier, the visuals are the highlight of this book.

The following visual shows the variance of test error estimate via validation set method

image

The following visual shows the variance of test error estimate via k fold CV. The variance of estimate is definitely less than validation method

image

The following visual shows the selecting a flexibility parameter based on LOOCV or k fold CV will yield almost similar results

image

The chapter ends with an introduction to bootstrap, a method that samples data from the original data set with replacement and repeatedly computes the desired parameters. This results in a realization of the parameter sample from which standard error of the parameter can be computed.

I learnt an important aspect of bootstrapping. The probability that a specific data point is not picked up in bootstrapping is 1/e. This means that even though you are bootstrapping away to glory, there are 1/3 of the observations that are not touched. This also means that when you try to build a model on one bootstrapped sample, you can actually test that model on the 1/3 of observations that haven’t been picked up. In fact this is exactly the idea that is used in bagging, a method to build regression trees. Well, you can manually write the code for doing cross validation, be it LOOCV or k fold CV. But R gives all of them for free. The lab mentioned at the end of the chapter talks about boot package that has many functions that allow you do to cross validation directly. In fact after you have worked with some R packages , you will realize that most of the packages have the cross validation argument in the modeling functions as it allows the model to pick the flexibility parameter based on k fold cross validation

After going through R code from the lab, I learnt that it is better to use subset argument in building a model on a subset of data. I was following a painful process of actually splitting the data in to test and trained sample. Instead one can put in the index of observations that are meant for training data, and model will be built on those observations only. Neat!

Linear Model Selection and Regularization

Alternative fitting procedures to least squares can yield better prediction accuracy and model interpretability. If clip_image001[12] there is no unique least squares coefficient. However shrinking the estimated coefficients, one can substantially reduce the variance of the coefficients, at the cost of a negligible increase in bias. Also least squares usually fits all the predictors. There are methods that help in the feature or variable selection.

There are three alternatives to least squares that are discussed in the book.

  • Subset Selection : Identify a subset of clip_image002[16] predictors that we believe to be related to the response
  • Shrinkage : Regularization involves shrinking the coefficients and depending on the method, some of the shrinkage coefficients can be 0, thus making it a feature selection method
  • Dimension Reduction : This involves projecting the clip_image002[17] predictors on to a clip_image003[8] dimensional space with clip_image004[8]

In the Best subset selection, you choose amongst clip_image005[6] models. Since this is computationally expensive, one can use forward / backward or hybrid methods. The number of models fitted in a forward / backward regression model is clip_image006[6]. How does one choose the optimal model? There are two ways to think about this. Since all the models are built on a dataset and that’s all we have, we can adjust the training error rate as it usually a poor estimate of test error rate. So, there are methods like Mallows clip_image007[4], AIC, BIC, Adjusted clip_image008. All these results are obtained via regsubsets function from the leaps package.

The other method is via Validation and Cross validation. Validation method involved randomly splitting the data and training the data on one set and testing the trained model on the test dataset. The model that gives the least test error is chosen. K fold cross validation method means splitting the data in to k segments, training the model using k-1 segments and testing it out on the held out segment and redoing this k times and then averaging out the error. Both these methods give a direct estimate of the true error and do not make distributional assumptions. Hence these are preferred over methods that adjust training error. An alternative to best subset selection is a class of method that penalizes the values the predictor coefficients can take. The right term to use is “regularizes the coefficient estimates”. There are two other methods discussed in the text, one being ridge regression and other being lasso. The penalizing function is an clip_image009 penalty for ridge regression whereas it is clip_image010[4] penalty for lasso.

The math behind the three types of regression can be succinctly written as

Ridge regression

image

lasso regression

image

Best subset regression

image

Some basic points to keep in mind

  • One should always standardize the data. For least square the estimates are scale invariant. But ridge regression and lasso are dependent on clip_image001[14] and the scaling of the predictor.
  • Ridge regression works best where the least squares estimate have high variance. In Ridge regression none of the coefficients have exactly 0 value. In lasso, the coefficients are taken all the way to 0. Lasso is more interpretable than Ridge regression.
  • Neither Ridge nor Lasso universally dominate the other. It all depends on the true model. If there are small number of predictors and the rest of predictors have small coefficients lasso performs better than ridge. If the modeling function is dependent on many predictors, then ridge performs better than lasso.
  • The best way to remember lasso and ridge is that lasso shrinks the coefficients by the same amount whereas ridge shrinks the coefficients by the same proportion. In either case the best way to choose the regularization parameter is via cross validation.
  • There is Bayes angle to the whole discussion on lasso and ridge. If you think from a Bayesian point of view, lasso is like assuming a Laplace prior for betas where ridge is like assuming a Gaussian prior for betas

The other method that is discussed in the chapter fall under dimension reduction methods. The first such method is Principal component regression. In this method, you create various linear combination of predictors in such a way that each of the components has maximum variance of the data. All dimension reduction methods work in two steps. First, the transformed predictors clip_image002[20] are obtained. Second, the model is fit using these clip_image003[10] predictors. There are two ways to interpreting PCA directions. One is the direction along which the data shows the maximum variance. The second way to interpret is the directions along which the data points are closest.

image

 

What does this visual say ? By choosing the right direction, one gets the predictors having maximum variance in this direction. This means the coefficients are going to be more stable if there is maximum variance shown by the predictor. The standard error of clip_image006[8] is clip_image007[6] . In all the cases when predictors are close, the standard errors are bloated. By choosing the directions along the principal components, standard error of clip_image006[9] is low.

Observe these visuals

image

image

The first component captures most of the information that is present in the two variables. The second principal component has nothing much to contribute. By regressing using these principal components, the standard errors of betas are less,i.e., the coefficients are stable. Despite having gone through Partial least squares regression in some of the previous texts, I think the following visual made all come together for me.

image

This shows the PLS based regression chooses a slightly tilted direction as compared to PCR. The reason being, the component directions are chosen in such a way that the response variable is also taken in to consideration. The best way to verbalize PCLS is that it is a “supervised alternative” to PCR. The chapter gives the basic algo to perform PLS

Towards the end of the chapter, the book talks about high dimensional data where clip_image014 . Data sets containing more features than observations are often referred to as high-dimensional. In such a dataset, the model might overfit. Imagine fitting a linear regression model with 2 predictors and 2 observations. It will be a perfect fit and hence residuals will be 0 and the clip_image015 will be 100%. When clip_image016 , a simple least squares regression line is too flexible and hence overfits the data.

Your training data might show that one should use all the predictors, but all it says that it is just fitting one model out of a whole lot of models to describe the data. How to deal with high dimensional data? Forward stepwise selection, ridge regression, lasso, and principal component regression are particularly useful for performing regression in the high dimensional setting. Despite using these three methods, there are three things that need to be kept in mind

  • Regularization or shrinkage plays a key role in high-dimensional problems
  • Appropriate tuning parameters selection is crucial for good predictive performance
  • The test error tends to increase as the dimensionality of the problem increases, unless the additional features are truly associated with the response

Another important thing to be kept in mind is in reporting the results of high-dimensional data analysis. Showing training is a no-no, as it does not paint the right picture. Instead it is better to report the cross validation error or results on an independent data set.

Moving Beyond Linearity

In all the models considered in the previous chapters, the form has been a linear form. One can think of extending the linear model and here are a few variants

  • Polynomial regression : This extends the linear model by adding powers of predictors
  • Step functions cuts the range of variable in to K distinct regions in order to produce a qualitative variable
  • Regression splines are more flexible than the above two. They involve dividing the range of X in to K distinct regions. Within each region, a polynomial function is fit to the data. However these polynomials are constrained so that they join smoothly at the region boundaries, called knots. Splines are popular because human eye cannot identify the discontinuity at the knots.

The following gives a good idea about regression splines

image

In the top left , you specify unconstrained cubic polynomials. In the top right, you impose a constraint of continuity at the knot. In the bottom left you impose a constraint on continuity, the first derivative and the second derivative. A cubic spline uses a total of clip_image003[14] degrees of freedom. A degree-d spline, is one that is piecewise degree-d polynomial, with continuity in derivatives up to degree d-1 at each knot. Fitting regression spline can be made easy by fitting a model with truncated power basis function per knot. 

To cut the variance of splines, one can use another constraint at the boundary that results in natural splines. 

What’s the flip side of Local Regression ? Local regression can perform poorly if number of predictors is more than 3 or 4. This is the same problem that KNN suffers from in the case of high dimensionality.

The chapter ends with GAM (Generalized Additive Models) that provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.

image

There are a ton of functions mentioned in the lab section that equips a reader to estimate the parameters of  polynomial regression models, cubic splines, natural spline, smoothing spline, local regression etc..

Tree Based Methods

What’s the basic idea? Trees involve stratifying or segmenting the predictor space in to a number of simple regions. In order to make a prediction, one typically uses the mean or the mode of the training observations in the region to which it belongs. Tree based methods are simple and useful for interpretation. However they are not competitive with the best supervised learning approaches. Hence a better alternative is creating multiple trees and then combining all of them in one big tree.

The chapter starts with Decision trees. These can be applied to both regression and classification tasks. In regression task, the response variable is continuous whereas in classification, the response variable is discrete. In the case of regression trees, the predictor space is split into clip_image001[16] distinct and non-overlapping regions clip_image002[22] . The goal is to find boxes such that

image

where clip_image004[10] the mean response rate within the jth box. To make it computationally practical, a recursive binary splitting method is followed. A top-down greedy algorithm is used. At each step the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. The problem with this approach is that the fitted tree could be too complex and hence it might perform poorly on the test data set. To get over this problem, one of the approaches is to build the tree and then prune the tree. This is called Cost complexity pruning. The way this is done is similar to lasso regression.

Classification tree is similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative response. Since one is not only interested in the class prediction for a particular terminal node, but also the class proportions among the training observations that fall in to the region. RSS cannot be used. What one needs to use classification error rate. The classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class. There are two other metrics used for splitting purposes (measures of node purity)

  • Gini Index
  • Cross-entropy index

The key idea from this section is this :

One can fit a linear regression model or a regression tree. The relative performance of one over the other depends on the underlying true model structure. If the true model is indeed linear, then linear regression is much better. But if the predictor space is non linear, it is better to go with regression tree. Even though trees seem convenient for a lot of reasons like communication, display purposes etc., the predictive capability of trees is not all that great as compared to regression methods.

The chapter talks about bagging, random forests and boosting, each of which are dealt at length in ESL. Sometimes I wonder whether it is better to understand a topic in depth before getting an overview of it or the other way round. The problem with decision trees is that it is has high variance. If the training and test data are split in a different way, a different tree is built. To cut down the variance, bootstrapping is used. You bootstrap the data, create a tree and aggregate against all the bootstrapped trees.

There is a nice little fact mentioned in one of the previous chapters that says that the probability of a data point that is not selected in a bootstrapping exercise approaches e. Hence for every bootstrap, there are 1/3rd of the training points that are not selected for tree formation. Once the tree is formed for that bootstrapped sample, these Out of bag sample can be tested and over multiple bootstraps, you can form an error estimate based for every observation. A tweak on bagging that decreases the correlation between the bootstrapped trees is random forests. You select at random a set of parameters; build a tree and then average over many such boot strap samples. The number of predictors chosen is typically square root of the number of parameters. The chapter also explains Boosting that is similar to bagging but learns slowly.

Unsupervised Learning

This chapter talks about PCA and clustering methods. At this point in the book, i.e. towards the end of the book, I felt that it is better to sweat it out with ESL and subsequently go over this book. The more one tries to sweat it out with ESL , the more ISL will turn out to be a pleasant read.. May be I am biased here. May be the right way could ISL to ESL. Whatever be the sequence, I think this book will serve as an apt companion to ESL for years to come.

 

imageimage

 

Right from the preface of the book, Prof. David Williams emphasizes that intuition is more important than rigour. The definition of probability in terms of long term frequency is fatally flawed and hence the author makes it very clear in the preface that “probability works only if we do not define probability in the way we talk about probability in the real world”. Meaning colloquial references to probability gives rise to shaky foundations. However if you build up probability theory axiomatically then the whole subject is as rigorous as Group theory. Statistics in the modern era is vastly different from yesteryears. Computers have revolutionized the application of statistics to real life problems. Most of the modern problems are solved by applying Bayes’ formula via MCMC packages. If this statement is surprising to you, then you should definitely read this book. Gone are the days when statisticians used to refer to some table of distributions, p-values etc. to talk about their analysis.In fact in the entire book of 500 odd pages, there is only about 15 pages of content on hypothesis testing and that too with a title “Hypothesis testing, if you must”. Today one of the critical ingredients of a statistician’s tool box is MCMC. Ok, let me attempt to summarize this book.

Chapter 1: Introduction

The author states that the purpose of the book is only to provide sufficient of a link between Probability and Statistics to enable the reader to go over advanced books on Statistics, and to show also that “Probability in its own right is both fascinating and very important”. The first chapter starts off by stating the two most common errors made in studying the subject. The first common mistake when it comes to actually doing calculations is to assume independence. The other common error is to assume that things are equally likely when they are not. The chapter comprises a set of examples and exercises to show the paradoxical nature of probability and mistakes that we can make if we are not careful in understanding the subject properly. The Two Envelopes Paradox illuminates the point that measure theoretic understanding of the subject is critical to understanding the subject. The idea of probability as a long-term relative frequency is debunked with a few examples that motivate the need for understanding measure theory.

This chapter is a charm as it encourages readers to embrace measure theory results in a manner that helps intuition. It says that by taking a few results from measure theory for granted, we can clearly understand the subject and theorems. The author says that “Experience shows that mathematical models based on our axiomatic system can be very useful models for everyday things in the real world. Building a model for a particular real-world phenomenon requires careful consideration of what independence properties we should insist that out model have”.

One of the highlights of the book is that it explains the linkage between probability and statistics in an extremely clear way. I have not come across such a lucid explanation of this linkage anywhere before. Let me paraphrase the author here,

In Probability, we do in effect consider an experiment before it is performed. Numbers to be observed or calculated from observations are at that stage Random Variables to be observed or calculated from observations – what I shall call Pre-Statistics, rather nebulous mathematical things. We deduce the probability of various outcomes of the experiment in terms of certain basic parameters.

In Statistics, we have to infer things about the values of the parameters from the observed outcomes of an experiment already performed. The Pre Statistics have now been crystallized in to actual statistics, the observed number or numbers calculated from them.

We can decide whether or not operations on actual statistics are sensible only by considering probabilities associated with the Pre-Statistics from which they crystallize. This is the fundamental connection between Probability and Statistics.

The chapter ends with the author reiterating the importance of computers in solving statistical problems. Not always analytical closed forms are desired. Sometimes they are not even possible, so one has to resort to simulation techniques. May be 20 years back, it was ok to somehow do statistics without much exposure to statistical software, not any more. Computers have played a fundamental role in the spread and understanding of statistics. Look at the rise of R and you come to the inevitable conclusion that “statistical software skills are becoming as important as modeling skills”.

Some of the quotes by the author that I found interesting are

  • The best way to learn how to program is to read programs, not books on programming languages.
  • In regard to Statistics, common sense and scientific insight remain, as they have always been, more important than Mathematics. But this must never be made an excuse for not using the right Mathematics when it is available.
  • Intuition is more important than rigor, though we do need to know how to back up intuition with rigor.
  • Our intuition finds it very hard to cope with sometimes perverse behavior of ratios
  • The likelihood geometry of the normal distribution, the most elegant part of Statistics, has been used to great effect, and remains an essential part of Statistical culture. What was a good guide in the past does not lose that property because of advances in computing. But, modern computational techniques mean that we need no longer make the assumption of normally distributed errors to get answers. The new flexibility is exhilarating, but needs to be used wisely


Chapter 2: Events and Probabilities

The second chapter gives a whirl wind tour of measure theory. After stating some basic axioms of probability like addition rule and inclusion-exclusion principle, the chapter goes on to define all the important terms/theorems of measure theory. This chapter is not a replacement for a thorough understanding of measure theory. However, if one does not want to plough through measure theory AT ALL, but wants to have just enough idea to get going, then this chapter is useful. For readers who already know measure theory, this chapter serves as a good recap of the subject that is necessary to understand statistics. The concepts introduced and stated in this chapter are Monotone Convergence Properties, Almost surely, Null sets, Null events, Borel-Cantelli Lemma, Borel Sets and functions, Pi System lemma. A good knowledge of these terms is priceless in understanding stats properly. Banach-Tarski Paradox is also stated so that one gets a good idea of the reason behind leaving out non measurable sets for constructing event space.

Chapter 3: Random Variables, Means and Variances

The tone of the book is sometimes funny. The author at the beginning of the chapter says

This chapter will teach you nothing about probability and statistics. All the methods described for calculating mean and variances are dull and boring and they are better efficient indirect methods described in the book in the later chapters.

Random variables are basically measurable functions and the reason for choosing Lebesgue/ Borel measurable functions is that it is hard to escape from the measurable world after you perform additions/subtractions/ multiplications/ taking limits on these functions etc. The author introduces a specific terminology in the book that helps the reader to understand the linkage between Statistics and Probability. A Pre-Statistic is a special kind of Random variable: a Random variable Y is a Pre-Statistic if the value Y(w_actual) will be known to the observer after the experiment is performed. Y(w_actual) becomes the observed value of Y. An example mentioned in the book which brings out the difference:

Let’s say you want to know the width of an object and let’s say you have 100 different observations for the width of the object. You can calculate the mean and variance of residual. But to crystallize these, one needs to study the Pre-Statistics M and RSS , know their distribution so one can compute confidence intervals. So, the Pre-Statistics become important and are a part of Probability. The inference of the parameters form a part of Statistics.

The chapter then talks about distribution functions, probability mass functions and probability distribution functions. Expectation of random variables is introduced via Simple random variables and non negative random variables. I really liked the discussion provided about supremum to connect ideas between expectation of general non negative random variable and simple non negative random variable . By going through the discussion, I am able to verbalize the ideas in a better way. Wherever measure theory is used, it is emphasized with letter M so that a disinterested reader can skip the content. However it is very important for any student of statistics to know the connection between integration and expectation. Lebesgue integral of a non negative measurable function is defined in terms of supremum of integral of simple random variables. This is where the connection to integration appears.

The chapter quickly introduces L1 space, L2 space and shows how various random variables fit in. The mean of a random variable can be computed if the random variable lies in L1 space whereas variance makes sense for a variable in L2 space. The fact that L1 space is a subset of L2 space and so on and so forth is a deep result, a result which becomes clear after having some knowledge of functional analysis, normed vector spaces, metric spaces, etc. So, my feeling is that a reader who has never been exposed to these spaces like Hilbert spaces would fail to appreciate the various aspects mentioned in this chapter. There is no alternative but to slog through the math from some other book to understand these concepts. However this chapter serves as a wonderful recap of functional spaces. The fact that the variance of a standardized random variable lies in L2 space needs to be understood one way or the other, be it through slog or through right kind of intuition.

Chapter 4: Conditioning and Independence

The chapter starts with these words

Hallelujah! At last, things can liven up.

The author is clearly kicked about the subject and more so about conditioning. The author’s enthusiasm might be infectious too. At this point in the book, any reader would start looking forward to the wonderful world of conditioning.

The chapter introduces the basic definition of conditional probability, states Bayes’ theorem, gives the famous Polya’s urn example to illustrate the use of conditioning. Polya’s urn is described as “a thing of beauty and joy forever”, meaning the Polya’s urn model is very rich mathematically. Indeed, there are books on probability models where all the models have Polya’s urn scheme as the spine. Look at this book Polya Urn Models , where innumerable models are discussed with Polya Urn in the background. Simple examples of Bayesian estimation is followed up with a rather involved example of Bayesian Change Point detection.

I think the section on genetics and the application of conditional probability has been the biggest take away from this chapter and probably from the entire book. I am kind of ashamed to write here that I had not read any books by Richard Dawkins, one of the greatest living scientific writers. Sometimes I am amazed at my ignorance on various things of life and then I am reminded of the innumerable activities that I had wasted my time on, in the past. Well, at least I am glad that I now recognize the value of time in a better way. Anyways the author strongly advocates the reader to go over at least a few books by Dr Richard Dawkins. Thanks to the torrent sites, all the books by Richard Dawkins are available for free. I have downloaded a couple and have put them on my reading list. At least I should read “The selfish gene” and “The Blind watch maker” soon. Apart from the fantastic theory, there is a tremendous scope to see the application of conditional probability to the field of genetics. I hope to understand these aspects as I keep working on honing my probability skills. Ok, coming back to the summary, the chapter then introduces the concept of independence , Borel Cantelli Lemma and shows through an example the connection between lim sup and Borel Cantelli Lemma. There are some exercises in the chapter that are challenging. “Five Nation problem” is something that I found interesting. It goes like this : If 5 nations play a sport against each other , i.e total 10 games , what is the probability that each nation wins 2 games? The answer is 3/128 and it is not immediately obvious. One needs to think a bit to get to the solution. I worked out using a simulation and then solved it analytically.

I knew the way to calculate the pdf of order statistics using a rather tedious approach. Crank out the cumulative cdf of IID variables, differentiate and prove that the order statistic follows a beta distribution. Williams explores the same statistic and proves intuitively but rigorously without resorting to calculus. I had persevered to understand the calculus approach earlier and hence this intuitive approach was a pleasant surprise that I will forever remember. The order statistics for the median of a uniform random sample is calculated and it is immediately followed up with an example. This is the recurrent pattern of the book. Whenever you see some statistic being estimated, in this case the median of the n iid uniform rvs, it is followed up with the usage of this median statistic .As one is aware, Cauchy distribution has no mean. By merely observing N Cauchy variable, mean will not help you estimate the parameter. It is the median that comes to your rescue. The median is likely to be close to the parameter in the Cauchy distribution.

The section of Law of Large numbers has an interesting example called the Car Convey problem which shows that a simple probability model can have a sophisticated mathematical underpinning.This example is a great application of Borel Cantelli Lemma and I hope to forever remember. Basic inequalities like Tchebyshev’s and Markov are introduced, the former being used in the derivation of weak law of large numbers.

Every experiment no matter how complicated it is, can be reduced to choosing a number from the uniform distribution (0,1). This fact is intuitively shown using Borel’s Strong Law for coin tossing. Kolmogorov’s Strong Law of Large numbers is stated and proved in an intuitive manner with less frightening measure theory math. In the end it is amply clear that the SLLN deals with almost sure convergence and WLLN deals with convergence in probability. This is whether things tie in. In an infinite coin tossing experiment it is the SLLN that is invoked to show that the proportion of heads almost surely converges to ½ and there are many sequences which do not converge to ½ but whose measure is 0. So, Convergence in probability is weak in the sense that average might not converge to mean at all but the probability that it converges to mean can be 1. The author then states the “Fundamental Theorem of Statistics”, (Glivenko- Cantelli) that talks about the convergence of sample distribution function to the population distribution function.Basically this is reason why bootstrapping theory works. You compute the ecdf and work with it for all sorts of estimation and inference aspects. The section also mentions a few examples which show that weak law is indeed weak -J , meaning the long term average can significantly differ from mean , but at the same time the long term average converges in probability to the mean.

The chapter then introduces Simple Random walks. This section is akin to compressing Chapter 3 of Feller Volume 1, stripping all the math and explaining the intuition behind reflection principle, hitting and return probabilities, etc. There is a section titled “Sharpening our intuition” which was highly illuminating. Some of the formulas derived in Feller show up in this section with the author showing the calculations behind hitting time and return time probabilities. Reflection Principle and its application to Ballot Theorem are also mentioned. As I go over these examples, I tend to remember statements made by Taleb, Rebanato, etc. who warn the fact that world of finance cannot be modeled using coin tossing models!.At the same time I think if it improves my intuition about randomness, I would not care whether these principles are useful in finance or not. Well, certainly I would love to apply these things in finance, but if my intuition improves in the process of working through them, that’s good enough. Anyways I am digressing from the intent of the post, i.e to summarize.

Some of the problems involving random walks are solved by forming difference equations. These difference equations inevitably contain a property that process starts afresh after a certain event. For example, if you are looking at gambler’s ruin problem ( you start with a capital a and you win if you can get to b before you hit 0), then after your first win, the gambler is facing the same problem as before except that you now have a capital of a+1 and you will win if you get to b or lose if you hit 0. In such examples where intuition yields correct results, one must also understand the details behind this “process starting afresh”. This topic is explored in a section title “Strong Markov Principle”. To understand Strong Markov Principle, one needs to understand Optional Stopping time. The highlight of this book as mentioned earlier, is that things are explained intuitively with just the right amount of rigour. Optional Stopping time is introduced in texts such as Shreve or some Math Finance Primer using heavy math. Books such as this need to be read by any student so that he /she gets the intuition behind optional stopping time. Williams describes this problem by asking a simple question, “For what kind of random time will the process start afresh”? This means that there are certain random variables, basically time variables that can be called as stopping Times if T<=n can be calculate from the variables X1,X2,…Xn. Generally Stopping time is the first time that something observable happens. With this definition, one can look at what sort of events that can be considered so that T can be stopping time for such variables. One must carefully understand this stuff that T is a stopping time for a process X under certain conditions. The book subsequently states “Strong Markov Principle” that can then be used in situations to justify that process starts again.

The chapter ends with a note on describing various methods to generate IIDS. Various random generators are explored like multiplicative, mixed multiplicative and congruential generators. Frankly I am not really motivated to understand the math behind these generators as most of them are coded in statistical packages. Well, whether it uses some complicated number theory properties or some trick is not something I want to spend time on. So quickly moved to the end of this section where Acceptance Sampling is described. Well, inverting a cumulative CDF and getting a random univariate from a distribution is not always possible as inverse for a CDF might not have a closed form. It is precisely to solve such a type of problem that accept-reject sampling procedure is tremendously useful. For a reader who is coming across the method for the first time, he/she typically wonders why it works. Ideally one must get the intuition right before heading in to the formal derivation of the procedure. The intuition as I understand from the book is as follows :

  • It is hard to simulate samples from a complicated pdf lets say f(x)
  • Choose a pdf which is analytically tractable and in such a way that that the scaled version of this pdf and the original pdf have same support values,lets call it g(x)
  • Use a criterion to accept the sample value from g(x) or reject it. – The logic here is : let’s say the ratio of f(z) / scaled g(z) 0.6 for z1 and 0.1 for z2, then one must prefer z1 as compared to z2. This can be made concrete by simulating a random number and comparing it with f(z) / scaled g(z) and then subsequently accepting or rejecting the simulated value.

The key is to understand the criterion which is in an inequality form. Obviously there are cases where Accept-Reject methods fail. In such situations, Gibbs sampling or MCMC methods are used. There is a lot of non-linearity in going about understanding statistics. Let’s say you are kicked about MCMC and want to simulate chains, you got to learn a completely new software and syntax ( WinBUGS). If you want to use R to further analyze the output, you got to use BRugs package. I just hope my non-linear paths become helpful in the overall understanding and don’t end up being unnecessary deviations from learning stuff.

Chapter 5: Generating Functions and Central Limit Theorem

Sum of random variables is something that one comes across in umpteen cases. So, it is imperative to know the joint distribution or joint pmf for the sum of random variables. The chapter talks about probability generating function (PGF), Moment generating function (MGF), Characteristic function(CF), Laplace transformation(LT) and Cumulant function(C).

Probability generating function is akin to signature of a random variable where one function can summarize the probabilities of various realizations. The key theorem that follows from these functions are – If the generating function for a sequence of random variables converges to the generating function of the limit Random variable, then probabilities also converge.

A pattern is followed in this chapter. For each of the 5 functions (probability generating function PGF, moment generating function MGF, characteristic function CF, Laplace transformation LT, Cumulant function), the following are proved

  • Convergence of PGF/MGF/CF/LT/C for a sequence of variables,X_n to a PGF/MGF/CF/LT/C of X means distribution function for the sequence of X_n converges to the distribution function of X
  • Independence of RV means one can multiply the PGF/MGF/CF/LT/C for each of the RVs
  • Uniqueness : If PGF/MGF/CF/LT/C for two variables match then the distributions for the two variables match

CLT, the workhorse of statistics is stated and proved using cumulant generating function approach. Integer correction form of CLT is stated for discrete random variables.

Chapter 6: Confidence Intervals for one parameter models

The chapter starts off with a basic introduction to frequentist view of point estimation. Using the Pre-Statistics, one can form the confidence intervals and then after the experiment is performed, the confidence intervals are computed. The point to note is that one makes a statement on the confidence interval and does not say anything about the parameter as such. In Williams’ words

A frequentist will say that if he were to repeat the experiment lots of times, each performance being independent of the others and announce after each performance that, for that performance, parameter belongs to a specific confidence interval, then he would be right about C% of the time and would use THAT as his interpretation of the statement that on the one occasion that the experiment is performed with result yobs, the interval as a function of yobs is a C% Confidence Interval.

The chapter then introduces the concept of likelihood function. The immediate relevance of this function is shown by the introduction of sufficient statistics. Sufficient statistics are very important in the sense that the sufficiency statistic is a lower dimension version of the random sample. How do you check whether the statistic is a sufficient statistic? One can use the factorization principle / criterion for sufficiency mentioned. One should be able to split the likelihood function in to two functions one that is independent of the parameter and other a function of sufficient statistic. If one is able to accomplish this, one can call this a sufficient statistic. An extremely good way to highlight the importance of sufficient statistic is given in the book, which is worth reiterating:

Let t be a sufficient statistic and let T be the PreStatistic which crystallizes to t. Let fT be the conditional density of T|theta when the true value of the parameter is theta. Then Zeus (father of gods) and Tyche can produce IID RVs Y1,Y2,…Yn each with pdf f(.|theta) as follows

  • Stage 1: Zeus picks T according to the pdf fT. He reports the chosen value of T but NOT the value of theta , to Tyche
  • Stage 2: Having been told the value of T, but not knowing the value of theta, Tyche performs an experiment which produces the required Y1,Y2,…Yn

Thus sufficient statistic tells everything that is needed to know about the sample.

Estimator is basically a function of the realized sample. This is what I vaguely remember having learnt. However the section makes it very clear that the estimator is a Pre Statistic and hence is a function of n random variables. One can study the pdf, pmf and everything related to it using probability. However once the experiment is performed, the estimator leads to an estimate. One of the basic criteria required for estimators is that it should be unbiased, meaning the expectation of the estimator given parameter theta is theta. However there can be many unbiased estimators that satisfy the above definition. What is the method to choose amongst the unbiased ones? Here is where the concept of efficiency comes in. Any unbiased estimator has a minimum variance bound defined by the inverse of Fischer Information ratio. If you have a bunch of estimators and one of them attains the minimum bound, then you can stop your search procedure and take the estimator as MVB unbiased estimator. There is a beautiful connection between Fischer Information and Variance of an estimator that is evident in Cramer-Rao Minimum Variance Bound. The inverse of Fischer information thus becomes the variance of the estimator. This section explains the difference between an Unbiased Estimator & MVB Unbiased Estimator for a parameter.

MLE are introduced not for its own sake but to throw light on confidence intervals. Firstly MLE is an estimator and like any estimator, it is a pre-statistic. The way MLE is connected to confidence intervals is through the fact that MLE is both MVB unbiased and approximately normal. This enables one to form Confidence intervals for the parameter pretending that it has a posterior density of a normal distribution where the variance is related to Fisher information ratio. The section then moves on to prove the consistency of MLE whereby as sample size increases the MLE estimator converges to the true value of the parameter. To appreciate the connection between MLE approximating to normal and forming confidence bands, Cauchy distribution is used. Typically Cauchy distribution has no variance and hence for large n, the usual methods are useless. MLE based confidence interval can be formed for a crazy distribution like Cauchy.

To prove the consistency of MLE, the concept of entropy and relative entropy is introduced. Relative entropy gives a measure of how badly a function f is approximated by g. This is also called Kullback-Leibler relative entropy. The connection between entropy and normal distribution is shown and subsequently the importance of normal distribution in Central Limit theorem is highlighted.

The chapter then goes on to cover Bayesian confidence intervals. The basic formula applicable for Bayesian statistics is that posterior is prior times likelihood. How does one choose priors ? What are the different kinds of priors ?

  • Improper priors : These priors cannot be normalized to a proper pdf
  • Vague Priors
  • Reference priors : Priors chosen based on Fischer information
  • Conjugate priors : The prior is chosen in such a way that the prior times likelihood becomes a prior distribution with different parameters

If one were to make any sensible inferences about parameters, then

  • Either one must have reasonable good prior information, which one builds in to one’s prior
  • or if one has only vague prior information, one must have an experiment with sample size sufficient to make one’s conclusions rather robust to changes in one’s prior

Whenever one chooses prior there is always a possibility that prior and likelihood are conflict arises where in the prior times likelihood is always so small and so the posterior density is something close to 0/0 . This is the main reason for the lack of robustness. The author gives an overview of Hierarchical modeling. He also mentions the famous Cromwell’s dictum and considers an application where a mixture of conjugates is taken and imaginary results are used to select a decent prior. Cromwell’s dictum basically says that one’s prior density should never be literally zero on a theoretically possible range of parameter values. One must regard it as a warning to anyone who chooses a prior density that is too localized and which is therefore extremely small on a range of parameter values which we cannot exclude. So, overall the section on Bayesian confidence intervals compresses the main results of Bayesian statistics in 20 odd pages. It distills almost every main feature of Bayesian stats.

In a book that is ~ 500 odd pages, Hypothesis testing takes up ~20 pages and that too with the title ,”Hypothesis testing – if you must”. This is mainly to remind readers that Hypothesis testing is a tool that has become irrelevant and the way to report estimates is through Confidence intervals.

CIs usually provide a much better way of analyzing data

With this strong advice against using hypothesis testing, the author does give all the main aspects of Hypothesis testing. The story as such is straightforward to understand. The MOST important test in frequentist world is Likelihood ratio test. In the frequentist world, Null and Alternate hypothesis are not placed on equal footing. Null is always “innocent unless proven guilty”. The idea behind LR test is this : Divide the parameter space in to two, as per the hypothesis. Compute the max likelihood of the data given the alternate parameter space and max likelihood of the data given the null parameter space, take the ratio of the two. Reject Null if the ratio is high. Sensible approach. But this strategy can reject null when infact null is true. So, new characters enter in to the picture. First is power function that evaluates the probability of rejection region given a parameter value. Ideally an experimenter would want power function to take 0 on null and 1 on alternate parameter space. Since this is too much to ask for, he sets close to 0 value for beta function in the null parameter space and close to 1 for the beta function in the alternate parameter space. This type of selecting values for beta function is a way to control Type I and Type II error. Having done, the experimenter computes the sample size of the experiment and the critical value of the LR test. There is a nice connection to chi square distribution that helps us quickly get through all this. Deviance is defined to be twice the logarithm of Likelihood ratio and this turns to be chi squared distributed. Thus one can directly use deviance to accept or reject null. Having given all these relevant details, the author reiterates the point that it is better to cite Confidence intervals, HDIs than doing hypothesis testing. In that sense, Bayesian stats is definitely appealing. But the problem with Bayes is priors. Where do we get them? The chapter ends with a brief discussion of model selection criteria.

Chapter 7: Conditional Pdfs and Multi-Parameter Bayes

The chapter starts off by defining joint pdfs/ conditional pdf/ joint pmf/ conditional pmf . It then shows a geometric derivation of Jacobian. Transformation of random variables is then introduced so that one can calculate the joint density of the transformed variable. One point which is often missed in the intro level probability books is that , when we talk about pdf , it is “a pdf” and not “the pdf” because one can define many pdfs which agree on all points except on a set of measure 0(Subtle point to understand0. For a novice, one might have to go through measure theory to understand. The section then derives Fischer and t distributions using transformation of appropriate random variables. Simulation of values from gamma and t distributions are discussed in detail. The following points summarize the interconnections

image

Frankly I speed read this part, becoz , well all it takes to generate a gamma rv is rgamma function and t rv is rt from R functions. Somehow when I first came across pseudo random numbers I was all kicked to know about the gory details about it, but then as time passed , I have developed an aversion to go in too much depth about these random number generators. As long as there is standard library that can do the job, I don’t care how it’s done. The chapter then introduced Condition pdf and marginal pdf to prepare ground for multiparameter Bayesian statistics.

If one needs to estimate mu and sigma for the data, one can form priors for each of these parameters, invoke the likelihood function and then compute the posterior pdf can be written down. However the expression is rather complicated. Conditional on mu, the expression is tractable and so also conditional on sigma the expression is tractable.

The idea of the Gibbs sampler , as for all MCMCs is that being given a sample with empirical distribution close to a desired distribution is nearly as good as being given the distribution itself

The last section of this chapter is about Multi-Parameter Bayesian Statistics. The content is extremely concise and I found it challenging to go over some of the content, especially towards the content that mentions Bayesian-Frequentist conflict. In a typical posterior density computation, the biggest challenge is the constant that needs to be tagged along with prior times likelihood function.The constant of proportionality is very hard to evaluate even numerically. This is where Gibbs sampling comes in. It allows one to get to the posterior density by simulation. The idea of Gibbs sampler is that being given a sample with Empirical Distribution close to a desired distribution is nearly as good as being given the distribution itself. In a posterior pdf that usually looks very complicated, if you condition on all the parameters except one, you get a tractable conditional density. So, the logic behind Gibbs sampling is : You condition on all the parameters except the first one, obtain an estimate for the first parameter, then condition on all the parameters except the second parameter, use the estimates of parameters from the previous steps, and you keep going until the posterior density converges. A nice feature of Gibbs sampling is that you can just ignore nuisance parameters from the MCMC chain and focus on whatever parameters you are interested in. The author then introduces WinBUGS to infer the mean and variance of a sample from a normal distribution. Full working code is given in the text for a reader to actually type in the commands in WinBUGS. Of course these examples merely scratch the surface. There are umpteen things to learn to model with WinBUGS and there are obviously entire books written that flush out the details.

The author then mentions quickly the limitations of MCMC

  • All of MCMC techniques are simulations, not exact answers
  • Convergence to the posterior distribution can be very slow
  • Converge to a wrong distribution
  • Chain get stuck in a local parametric space
  • Might not be robust to changing priors

The author also warns that one should not specify vague priors to hyper parameters as it can lead to improper posterior densities. There is also a mention of the problem of simultaneous confidence intervals. Confidence region for a vector parameter has two difficulties. Firstly, it can be rather expensive in computer time and memory to calculate such a CR. Secondly it is usually difficult to convey the results in a comprehensible way. The section ends with a brief discussion on the Bayesian-Frequentist conflict, that I found it hard to understand. May be I will revisit this at a later point in time.

Chapter 8: Linear Models and ANOVA

The chapter starts off by explaining the Orthonormality principle in two dimensions. Orthonormality Principle marks the perfect tie-up which often exists between likelihood geometry and the least-squares methods. Before explaining the math behind the models, the author gives a trailer for all the important models. The trailers contain visuals that attempt to provide motivation to go over the abstract linear algebra setting taken in the chapter. These trailer concern the normal sampling theorem, linear regression and ANOVA. The second section of the chapter gives a crash course on Linear algebra principles and then states the orthonormality principle for projections. This principle gives us a way to look at the geometry for a linear model. If you are given a basic linear model Y = mu + sigma G , then one way to tie the likelihood geometry and inference is as follows :

  • Write down the likelihood function
  • MLE for mu is the projection on to the subspace U , relevant to null hypothesis space
  • Compute LR test statistic
  • Compute Deviance
  • Compute the F statistics
  • if n is large , then Deviance is related to chi^2 statistic

One very appealing feature of this book is that there is Bayesian analysis for all the models. So, one can see the WinBUGS code written by the author for linear regression and ANOVA. It gives the reader a good idea about the flexibility of Bayesian modeling. Even though the mathematics is not as elegant as the frequentist world, the flexibility of Bayes more than compensates for it. The chapter explains the main techniques for testing goodness of fit, i.e. qqplot, KS test(basically its a way to quantify quantile test) and likelihood based LR test. LR test can be approximated to classic Pearson Stat. base R allows you to draw qqplot for gaussian distributions. For other distributions, one can check the car package that not only supports different distributions but also draws confidence bands for the observed quantile data. Goodness of fit coupled with parameter estimation is a dicey thing. I had never paid attention to this till date. I was always under the impression that parameter estimation can be done and subsequently goodness of fit can be run based on the estimates. The author clearly gives the rationale for the problems that might arise by combining parameter estimation and goodness of fit testing.

There is a very interesting example that shows the sticking phenomenon, the MCMC chain sticks to a certain states and never converges to the right parameter value. The example deals with the data from a standardized t distribution and you have to infer the distribution of the degrees of freedom parameter. Gibbs sampling and WinBUGS do not give the right value of the degree of freedom parameter. In fact the chain never goes to the state relating to the true parameter. This example has made me curious to look up in the literature on ways to solve "sticking problem" in MCMC methods. The chapter ends with a section on Multivariate normal distribution that defines the pdf of MVN, list down various properties of MVN, derives the CI for the correlation between bivariate normal RVs, states Multivariate CLT.

Out of all the chapters in the book, I found this chapter to be most challenging and demanding. I had to understand Fischerian statistics, likelihood modeling, WinBugs, Bayesian hierarchical modeling so as to follow the math in this book. In any case I am happy with myself that the effort put in the preparation has paid off well. I have a better understanding of Frequentist and Bayesian modeling after reading through this chapter.

Chapter 9 : Some Further Probability

This chapter comprises three sections. First section is on Conditional probability.Standard stuff is covered like the properties of conditional expectation , applications to the area of branching etc. What I liked about this section is the thorough discussion of the "Two envelopes paradox". The takeaway is to be cautious about the conditional expectation of Random variables that are not in L1 space.

The second section is on Martingales. I think this section can form a terrific primer for someone looking to understand Martingales deeply. This can also be a prequel to working through the book "Probability with Martingales" from the same author. The section starts off by defining martingale and giving a few examples of martingales resulting from a random walk. The constant Risk Principle is stated as a criterion for checking whether the stopping time is finite or infinite. This is followed up Doob’s stopping time principle. The author states in this context that STP principle is the one of the most interesting subsections in the book. Indeed it is. Using the STP principle, many problems become convenient computationally. The highlight of this section is treating a “waiting time for pattern" problem as a martingale. Subsequently an extension of Doob’s STP, called the Doob’s optional sampling theorem is stated. The author also states Martingale – Convergence theorem, that he considers one of the most important results in mathematics.

Towards the end of the section, the author proves Black Scholes option pricing formula using Martingales. The third section is on Poisson processes. I think section is basically meant to provide some intuition of the diverse applications of conditional probability models. The last chapter of the book is on Quantum Probability and Quantum computing that was too challenging for me to read through.

clip_image002 Takeaway :

The author starts off the book with the following statement : ”Probability and Statistics used to be married; then they separated; then they got divorced; now they hardly ever see each other. The book is a move towards much needed reconciliation”. The book does a terrific job in rekindling that lost love. This is, by far the best book on “probability and statistics” that I have read till date. Awesome work by the humble Professor, David Williams.

image

This book reminds me of “Elementary Stochastic Calculus with Finance in view”, a book by Thomas Mikosch, in terms of the overall goal. This book has a goal of making the reader understand the nuts and bolts of Black Scholes pricing formula. Probability theory, Lebesgue integration and Ito Calculus are the main ingredients in the Black Scholes formula and these rely on set theory, analysis and an axiomatic approach to mathematics. Any thing in math is built ground up. This means that every idea/proof/lemma/axiom is pieced together in a logical manner so that the overall framework makes sense. This book introduces all the necessary ingredients in a pleasant way. There are some challenging exercises at the end of every chapter and the reader is advised to work through all of them, and the author motivates the reader by saying

An hour or two attempting a problem is never a waster of time and to make sure this happened, exercises were the focus of our small-group weekly workshops

Chapter 1: Money and Markets

The first chapter gives a basic introduction to the time value of money and serves as a basic refresher to calculus.

Chapter 2: Fair Games

The irrelevance of expectation pricing in finance is wonderfully illustrated in Baxter and Rennie’s book on Financial Calculus. Why is expectation based pricing dangerous? The reason being the expectation pricing is not enforceable. There is another kind of pricing that is enforceable and any other kind of pricing techniques is dangerous. This pricing technique goes by the name “arbitrage pricing”. This chapter starts off with a basic example where two people, John and Mark, play a betting game with each other. The expectation pricing will make sense in this case if they play a lot of games with each other.

The second example is where John and Mark place bets with a bookmaker on a two horse race contest. The bookmaker offers odds for each of the horse. These odds can be used to cull out implied probability of each horse winning. If the bookmaker does not quote odds based on arbitrage pricing, he risks going bankrupt. There is no place for expectations based pricing here. Based on the odds quoted, a particular horse might have two different probabilities with respect to John and Mark. Basically this means that John and Mark are operating in different probability spaces. If there is a single player betting one each horse, the bookmaker can quote odds and be done with it, assuring himself a guaranteed profit. However as the bets start accumulating, he becomes more and more prone to a risk of huge loss. He must either change the odds or hedge the exposure. To remove uncertainty from his exposure, the bookmaker can place a bet on the horse that whose win is likely to make him bankrupt, with another bookmaker. In this way he gets a guaranteed profit or at least can think of a breakeven.

If you think from a bookmaker’s perspective, all the activities he does like quoting the odds, hedging, changing the odds are the same activities of a derivative contract seller. In fact this chapter is a superb introduction to the concept of derivative pricing under equivalent martingale measure. I had loved this introduction in Baxter and Rennie’s and was thrilled to see the same kind of intro in this book. In fact I think any book on derivative pricing should have "Banish expectation pricing – Embrace arbitrage pricing" slogan at the very beginning.

Chapter 3: Set Theory

The chapter begins with motivating the reader that he/she has to go through theorems , proofs, lemma , corollaries as they are the organizing principles of any field. By learning these abstract ideas, one can apply the learning to various situations and is a much better way than learning case specific results. In the author’s words

The axiomatic approach does contain a degree of uncertainty not experienced when the classical approach of learning by rote and reproducing mathematical responses in carefully controlled situations us followed, but this uncertainty gradually diminishes as progress is achieved. The cause of this uncertainty is easy to identify. Over time one becomes confident that understanding has been acquired, but then a new example or result or failure to solve a problem or even to understand a solution may destroy this confidence and everything becomes confused. A re-examination is required, and with clarification, a better understanding is achieved and confidence returns. This cycle of effort, understanding, confidence and confusion keeps repeating itself, but with each cycle progress is achieved. To initiate this new way of thinking usually requires a change of attitude towards mathematics and this takes time.

Having given this motivation, the author talks about the mysterious “infinity” that is at the heart of mathematical abstraction. The reader is taken through countability and least upper bound / greatest lower bound concepts etc. After this selected journey through real numbers, the chapter talks about sigma algebras. May be some other books carried a visual about filtration. I don’t recollect it now. In any case, the best thing about this chapter is the visual that it provides for a discrete filtration.

In a typical undergrad probability one works with situations where the entire outcome space is visible right away, one does not need the concept of sigma algebras. In finance though, there is a time element for a random stock price or any random quantity. You do not know the entire outcome space. Information gets added as you move from one day to the next. So, in one sense, one needs to be comfortable with sigma algebras that are subsets of the master sigma algebra, a term that I am using just to make things easier to write. Not all subsets of outcome space select themselves as events at an appropriate time. Definitions of sigma algebra and measurable space are given. Subsequently, the concept of “generating set for the sigma field” is explored. The generating sets are usually small in size. A nice example is given where one is asked to describe the sigma field generated by a collection of subsets. Soon enough you realize that the exercise becomes tedious if you try to include unions and intersections manually. The mess is too painful to ward through. An alternative solution using partitions is presented in the chapter that makes the exercise of "sigma field generated by a collection of subsets " workable.The key idea is to relate partitions to equivalence classes and then use these equivalence classes to quickly generate the sigma algebra. In Shreve, I came across the phrase,“ Sets are resolved by the sigma algebra”. In this chapter though, the sentence is aptly summarized by many visuals. Getting a good idea about the filtration in discrete space will mightily help in transitioning to the continuous time space.

Chapter 4 : Random Variables

Random variables are basically functions that map from outcome space to R. Ideally one can just be in (Omega, F) and do all the computations necessary. However to take advantage of the rich structure of R, it should be connected with F. This is typically done using the concept of measurability. So, in a way one is talking about moving to a different space for computational ease. To move from one space to another, the structure has to be preserved. This structure goes by the name sigma field. The mappings are called measurable functions.

  • For measurable spaces (Omega, F), you have measurable functions to move to R
  • For measure spaces (Omega, F, P), you have random variables to move to R

Whenever one talks about Measurable functions, there are some core concepts that one needs to grasp. Firstly, the inverse mapping of the measurable functions should be defined in the F. Only then it makes sense. An intuitive way of thinking about measurable functions is that they are carrier of information. Once the basic criteria for measurable function is satisfied, then one needs to give some recognition to measurable function. So, we will give names to, let’s say the “sigma algebra generated by the random variable” to the inverse mappings of all the borel sets in R as the sigma algebra generated by X, denoted by F_X. Typically F_X is a subset of F. In order to check whether a function is a measurable function on a measurable space, one needs test candidates. Borel sigma algebra is very huge and hence testing each set is going to take ages. A convenient way is to select candidate sets that are collection of sets, A that generate a Borel sigma algebra, B. So, instead of checking all subsets of Borel sigma algebra, one can merely check whether the inverse mappings of the collection A is in the F.The total information available from an experiment is embedded in the sigma field of events F of the sample space Omega.

The book puts it this way :

The real valued function X on Omega may be regarded as a carrier of information in the same way as the satellite relays message from one source to another. The receiver will hopefully extract from the Borel set B information. An important requirement when transmitting information is accuracy. If X is not measurable, then inverse mapping is not an observable event and information. For this reason we require X to be measurable. If X is measurable, then the information transferred will be about events in F_X. Complete information will be transferred if F_X = F, and at times this may be desirable. On the other hand, F may be extremely large and we may only be interested in extracting certain information. In such cases, if X secures the required information, we may consider X as a file storing the information about all the events in F_X. In the case of a filtration, we obtain a filing system.

Chapter 5 : Probability Spaces

If you take a measure space and then attach a measure, it becomes a measurable space. It can be called a probability space if measure of the entire outcome space is 1 and measure satisfies countably additivity. With an experiment you can associate a probability space. Let’s say you have two variables with their probability spaces. If you want to combine the two probability spaces and create a more generic structure, you can go about it this way: First, define a combined outcome space, then define product sigma algebra and then extend the probability measure on to events in the product sigma algebra using the probability measures of the individual space.

Random variables are introduced in this chapter. These are measurable functions that are defined with a restriction on the type of measure. The random variables are mappings from the outcome space to real line. These random variables in turn generate a sigma algebra and thus there are two probability spaces that one can think of , in the context of a random variable, one the original space and the other the induced probability space that is defined on the outcome space of real line and having the event space borel sigma algebra. Thus two variables can be dependent or independent depending on the probability measure applicable to the measurable space Two random variables with the same measurable space but different probability measures can be dependent or independent. The chapter gives conditions that are necessary for the independence of two random variables. Obviously if the sigma fields generated by random variables are independent , then the variable are independent. But if the sigma fields generated by the random variables are not independent, then the independence is decided by the probability measure attached to the common event space.

Chapter 6 : Expected Values

Expected values, lengths and areas are measurements which share a common mathematical background. Basically if you want to measure something, you try to divide it in to arbitrary lengths and then approximate the lengths by a suitable number. During the final decade of 19th century, mathematicians in Paris began investigating which sets in Euclidean space R^n were capable of being measured.Emile Borel made the fundamental observation that all figures used in science could be obtained from simple figures such as line segments, squares and cubes by forming countable unions and taking complements. He suggested the term sigma field a collection of sets of large enough to cover most of the stuff that we come across. He defined the measure of a set as a limit of the measurements obtained by taking countable unions. He did not succeed, as he could not show that the resulting measure of a set was independent of the way it was built up from simple sets.

Henri Lebesgue used Borel’s ideas on countability and complements but proceeded in a different way. He defined measurability in a different way and in the process lead to the introduction of “Lebesgue measure” and “Lebesgue measurable set” and “ Lebesgue measurable spaces”.

There are 4 types of mathematical objects described in this chapter, i.e. simple random variables, positive bounded random variables, positive measurable variables and integrable random variables. Firstly, simple random variables are just step functions for which expectations are easy to compute. When you talk about expectation, one can use either E or integral sign to denote it. Survival of fittest did not seem to have happened in the notation for expectation. Using E for expectation is better for denoting conditional expectation, Martingales etc. However integral sign is good for showing expectation for disjoint subsets etc. It is important to pay attention to this fact of notation.

The second level of sophistication is defining a positive bounded random variable. Any positive bounded random variable has a canonical representation that involves simple random variables.All the properties of positive bounded random variables can be explored using the canonical representation.

The third level of sophistication is defining a positive random variable. Any positive random variable can be represented as a sequence of positive bounded variables. Using this fact, all the properties of positive random variables are stated and proved.

The fourth level of sophistical is defining an integrable random variable. Any integrable random variable can be represented by positive random variables

The pattern that is used across the four mathematical objects is the following :

  • Form two increasing sequences of a specific type of variable(simple random variables, positive bounded random variables, positive measurable variables and integrable random variables.
  • Assume they are pointwise convergent
  • Prove that expectation of both sequences converge

The reason for introducing these 4 types of mathematical objects is this: Use the expectation of simpler random variables to form the expectation of a sophisticated random variables. The chapter ends with proving two important theorems, Monotone convergence theorem and Dominated convergence theorem. These theorems are used in most of the proofs.

Chapter 7 : Continuity and Integrability

The following are the highlights of this chapter :

  • There is a connection between integrability and convergence of series of real numbers. The fact that every absolutely convergent series is convergent can be proved using a random variable defined on a probability triple.
  • Independence property between two random variables can be characterized by the expected values. If you take the product of two random variables and take the expectation, the expectation operator splits the product in to individual expectations.
  • Measures defined and introduced.
  • Riemann integral is defined for continuous functions and criterion for Riemann integration is given. Given this context, Lebesgue integral is introduced as an integral of a random variable with respect to a probability measure.
  • Law of Unconscious Statistician stated
  • Convergence Pointwise implies Convergence almost surely which in turn implies Convergence in Distribution
  • Chebyshev’s Inequality
  • Convex functions and Jensen’s inequality

Chapter 8 : Conditional Expectation

Instead of diving in to the topic of conditional expectation right away, the chapter introduced a 2 step binomial tree and shows the realizations of a conditional expectation variable for various sample paths. Principles such as discrete filtration, adapted to a filtration are discussed in the context of a two step tree. If the conditional sigma algebra is generated by countable partitions then the conditional expectation variable can be interpreted pointwise. However in other general cases, the conditional expectation needs to be inferred from the a few conditions. They are no specific formula for conditional expectation. In this context, the Radon Nikodym theorem is stated with out proof. Certain basic properties of conditional expectation are also stated and proved.

Chapter 9: Martingales

The chapter starts off with a basic definition of a discrete martingale adapted to a filtration on a probability space. Four examples are provided which give the reader enough knowledge to apply the principles in option pricing. The think I liked about this chapter is the Martingale convergence section. If you have a martingale and if it is bounded in Lebesgue measurable space, then it converges almost surely. This property of Martingale convergence is extremely useful in various problems. The line of attack for producing a distribution function for a random variable is to figure out a martingale that has this variable in it , apply martingale convergence theorem, find the limiting value of the martingale and from that expression, extract the distribution of the random variable. The chapter also defines continuous martingales and mentions Girsanov theorem in its elementary form. Chapter 10 talks about Black Scholes option pricing formula. The last chapter is a good rigorous introduction to stochastic integration.

imageTakeaway:
This book painstakingly builds up all the relevant concepts in probability from scratch. The only pre-requisite for this text is “persistence” as there are a ton of theorems and lemmas throughout the book. The pleasant thing about the book is there are good visuals explaining various concepts. One may forget a proof but visual sticks for a long time. So, in that sense this book scores over many other books.

image

Solving an SDE analytically can be done only in few instances(toy SDEs). For the majority of the cases, one solves it numerically. Having said that, this book can be read by anyone who is interested in understanding SDEs better. Simulation is a great way to understand many aspects of Stochastic processes. For example, you can read through Girsanov theorem for change of measure, but by visualizing it through a few sample paths, you have a deeper understanding . I have managed to go over only the chapters that deal with simulation and my summary would obviously comprise only those chapters,i.e. the first two chapters. I have postponed reading Chapter 3 that goes in to inference and Chapter 4 that comprises a set of advanced topics. May be I will find time to go over it in the future. For now, let me mention a few points from the book.

The first chapter is a crash course on stochastic processes and SDEs. It zips past through basic probability concepts and then introduces change of measure. To give a practical application of change of measure, the chapter mentions preferential sampling technique and gives an example to illustrate its utility. All the relevant terms that one comes across in the context of stochastic processes are defined such as Filtrations, Measurability with respect to a filtration, Quadratic variation etc. As expected, the chapter provides R code that simulates a Brownian motion, Geometric Brownian Motion, Brownian Bridge etc.

In fact the good thing about this book is that it comes with cran package “sde” that has functions to simulate different stochastic processes. So, once you understand the way it has been coded, you can always make use of the functions provided in the package, instead of coding everything from scratch. Ito integral is defined via the limit of Ito integral of simple bounded functions. This approximation procedure is explained very well in Oksendal’s book. However if one is not inclined to go over the math and wants to visually see this, one can always check out the code in this book where a sequence of Ito integrals of non anticipating functions converge in mean square to Ito integral of a general integrand. All said and done, visual understanding is good but not enough. The effort of going through the math is well worth it as you will know exactly why the whole thing works. The chapter also lists the important properties of Ito Integral.

Diffusion processes are introduced and conditions for uniqueness and existence of the solutions are provided. These conditions are merely stated here and if one wants to understand the reasons behind those conditions, I will mention Oksendal’s book again that does a fantastic job of explaining the nuts and bolts. The thing I liked about this chapter is the visual illustration of Girsanov’s theorem. This is a theorem that helps change the drift of a Brownian motion. The code present in the book shows the changed likelihood of paths. I have tweaked a bit to illustrate two cases, change of measure that makes 1) a drifting Brownian motion to drift less Brownian motion and 2) a drift less Brownian motion to a drifting Brownian motion. Here is a sample of 30 sample paths and the respective path probabilities under change of measure.

imageimage

The path probabilities that are darker imply higher weights to those paths. The visuals on the left hand side show that to make a drifting BM to a driftless BM, one has to weigh less the paths that are drifting. The visuals on the right show that, to make a driftless BM to drifting BM, one has to weight more the paths that tend to drift. This is exactly done by Radon Nikodym derivative.

The chapter ends by stating the following list of models/sdes and computing their moments, conditional density, conditional expectation and conditional variance(wherever possible)

  • Ornstein-Uhlenbeck or Vasicek process
  • Geometric Brownian motion model
  • Cox-Ingersoll-Ross model
  • Chan-Karolyi-Longstaff-Sanders (CKLS) family of models
  • Hyperbolic processes
  • Nonlinear mean reversion Sahalia model
  • Double-well potential model
  • Jacobi diffusion process
  • Ahn and Gao model
  • Radial Ornstein-Uhlenbeck process
  • Pearson diffusions
  • The stochastic cusp catastrophe model
  • Generalized inverse gaussian diffusions

The second chapter starts with two most popular techniques to solve SDEs, first is the Euler approximation and second is the Milstein scheme. Lamperti transform is introduced to show that Euler approximation on Lamperti transform of SDE is equivalent to Milstein scheme. Lamperti transform is explained via applying it to GBM , CIR process, that results in a simplified SDE. The workhorse of the chapter as well as the book is the function sde.sim().

The following can be accomplished using the above function

  • Processes that can be simulated: OU process, GBM , Cox-Ingersoll-Ross process, Vasicek process
  • Simulation method: Euler, KPS, Milstein, Milstein2, Condition density , Exact Algorithm, ozaki, and shoji
  • Law : One can simulate from the conditional law or the stationary law for the above 4 processes
  • Exact Algorithm: A very powerful algorithm that uses a biased Brownian motion and change of measure principles to simulate a solution for the SDE. The algorithm uses hitting time of Poisson process for simulating the SDE.

The chapter also gives a detailed example that shows the performance of Milstein vs. Euler approximation method. Even though simulating from the transition density is possible in only few cases, the chapter suggests that wherever possible it should be preferred over other simulation methods. There is a discussion of local linearization methods that takes of relaxes some assumptions of Euler/Milstein schemes, i.e. the drift and diffusion coefficient are not assumed constant in the partitioned time interval.

In the “sde” package, there are about 45 functions that one can use for SDE simulation and inference. I have tried categorizing the functions below (based on my limited understanding):

image

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 100 other followers