This book gives a a macro picture of  machine learning. In this post, I will briefly summarize the main points of the book. One can think of this post as a meta summary as the book itself is a summary of all the main areas of machine learning.



Machine learning is all around us, embedded in technologies and devices that we use in our daily lives. They are so integrated with our lives that we often do not even pause to appreciate its power. Well, whether we appreciate or not, there are companies harnessing the power of ML and profiting from it. So, the question arises, whether we need to care about ML at all ? When a technology becomes so pervasive, you need to understand a bit. You can’t control what you don’t understand. Hence at least from that perspective, having a general overview of the technologies involved, matters.

Machine learning is something new under the sun: a technology that builds itself. The artifact in ML is referred to as a learning algorithm. Humans have always been designing artifacts, whether they are hand built or mass produced. But learning algorithms are artifacts that design other artifacts. A learning algorithm is like a master craftsman: every one of its productions is different and exquisitely tailored to the customer’s needs. But instead of turning stone in to masonry or gold in to jewelry, learners turn data into algorithms. And the more data they have, the more intricate the algorithms can be.

At its core, Machine learning is about prediction: predicting what we want, the results of our actions, how to achieve our goals, how the world will change.

The author says that he has two goals in writing this book:

  • Provide a conceptual model of the field,i.e. rough knowledge so that you can use it effectively. There are many learning algorithms out there and many are being invented every year. The book provides an overview of learning algos by categorizing the people who use them. The author calls each category a tribe. Each tribe has its own master algorithm for prediction
    • Symbolists: They view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic.
      • Master algorithm for this tribe is inverse deduction
    • Connectionists: They reverse engineer the brain and are inspired by neuroscience and physics.
      • Master algorithm for this tribe is backpropagation
    • Evolutionaries: They simulate evolution on the computer and draw on genetics and evolutionary biology.
      • Master algorithm for this tribe is genetic programming
    • Bayesians: They believe that learning is a form of probabilistic inference and have their roots in statistics.
      • Master algorithm for this tribe is Bayesian inference
    • Analogizers: They learn by extrapolating from similarity judgments and are influenced by psychology and mathematical optimization.
      • Master algorithm for this tribe is Support vector machines.

In practice each of the master algorithms that a particular tribe uses is good for some type of problems but not for others. What we really want is a single algorithm combining the key features of all of them, The Master Algorithm

  • Enable the reader to invent the master algorithm. A layman, approaching the forest from a distance, is in some ways better placed than the specialist, already deeply immersed in the study of particular trees. The author suggests the reader to pay attention to each tribe and get a general overview of what each tribe does and what tools that each tribe uses. By viewing each tribe as a piece of a puzzle, it would be easy to get a conceptual clarity of the entire field as a whole.

The Machine Learning Revolution

An algorithm is a sequence of instructions telling a computer what to do. The simplest algorithm is : flip a switch. The second simplest algorithm is : combine two bits. The idea connecting transistors and reasoning was understood by Shannon and his masters thesis lead to the most important scientific discipline – information theory. Computers are all about logic. Flipping a set of transistors is all what any algorithm does. However behind this benign activity, some of the most powerful algorithms go about doing their work by using some preexisting algos as building blocks. So, if computing power increases, do all the algos automatically become efficient and all powerful? Not necessarily. The serpent in the garden goes by the name "complexity monster". There are many heads to this complexity monster.

  • Space complexity : the number of bits of info that an algo needs to store in the computer’s memory
  • Time complexity : how long the algo takes to run
  • Human complexity : when algos become too complex, humans cannot control them and any errors in the algo execution causes panic.

Every algorithm has an input and an output – the data goes into the computer, the algo does the job and gives out the result. Machine learning turns this around : in goes the data and the desired result and out comes the algorithm that turns one in to the other. Learning algorithms – learners- are algorithms that make other algorithms. With machine learning, computers write their own programs. The author uses a nice low tech analogy to explain the power of machine learning :

Humans have known a way to get some of the things they need by letting nature make them. In farming, we plant the seeds, make sure they have enough water and nutrients, and reap the grown crops. The promise of machine learning is that the technology can mirror farming. Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.

This analogy makes two things immediate. First the more data we have, the more we can learn. Second, ML is a sword that can slay the complexity monster. If one looks at any learning algorithm, it does broadly two things. It either learns knowledge or learns some skills, i.e. it learns knowledge in terms of statistical models or learns procedures that underlie a skill. The author talks about another analogy in the information eco-system. He equates databases, crawlers, indexers and so on as herbivores, patiently munging on endless fields of data. Statistical algos, online analytical processing are the predators and learning algorithms are the super predators. Predators turn data in to information and Super predators turn information in to knowledge.

Should one go and spend years in computer science discipline to become ML expert ? Not necessarily. CS makes you think deterministically and ML needs probabilistic thinking. The difference in thinking is a large part of why Microsoft has had a lot more trouble catching up with Google than it did with Netscape. A browser is just a standard piece of software, but a search engine requires a different mind-set.

If you look at all the fantastic stuff behind ecommerce applications, it is all about match making; producers of information are being connected to consumers of information. In this context ,learning algorithms are the match makers. The arsenal of learning algos will serve as a key differentiator between the companies. Data is indeed the new oil.

The chapter ends with a discussion of how political campaigning is being revolutionized by ML. With truck loads of data at each voter level, politicians are turning to ML experts to help them with campaign analysis and directed targeting. The author predicts that in the future, machine learning will cause more elections to be close. What he means by that is that learning algorithms will be ultimate retail politicians.


The Master Algorithm

If you look at any of the algorithms such as nearest neighbor, decision tree, naive Bayes etc., they are all domain agnostic. This warrants a question, "Can there be one learner that does everything, across all domains?". If you have used let’s say a frequentist way of estimating a parameter and subsequently check it with a Bayesian inference, then both parameters give almost the same distribution if there is a LOT of data. Under a deluge of data, you will most certainly get a convergence of parameters in a model, irrespective of the model used. However the reality is that there is a scarcity of data, that necessitates making assumptions. Depending on the type of problem and depending on the type of assumption, certain class of learning models are better than the rest. If one speculates the presence of a master algorithm, then the assumptions also need to go in as input. This chapter explores the presence of a master algorithm that can be learnt from input data, output data and the assumptions. The author grandly states the central hypothesis of the book as

All knowledge-past, present, and future-can be derived from data by a single, universal learning algorithm.

The author speculates the presence of a master algorithm and says that there are arguments from many fields that speculate the presence of one.

The argument from Neuroscience : The evidence from the brain suggests that it uses the same learning algorithm, throughout, with the areas dedicated to the different senses distinguished only by the different inputs they are connected to. In turn, the associative areas acquire their function by being connected to multiple sensory regions, and the "executive" areas acquire theirs by connecting the associative areas and motor output.

If you look at any cortex in a human brain, you find that the wiring pattern is similar. The cortex is organized in to six different layers, feedback loops, short range inhibitory connections and long-term excitatory connections. This pattern is very much common across the brain. There is some variation in the patterns but the structure pretty much is the same. The analogy is that the it is the same algo with different parameters and settings. Low-level motor skills are controlled by cerebellum that has a clearly different and regular architecture. However experiments have shown that cerebellum can be perfectly replaced by cortex. So, this again goes to suggest that there is some "master algorithm" controlling the entire brain.

If you look at a set of learning algos in the ML field, you can infer that at some level they are trying to reverse engineer the brain’s function. One of the five ML tribes, Connectionists, believe in this way of modeling the world.

The argument from Evolution : If one looks at the evolution of species since the beginning of earth, one can think of natural selection or whatever the nature does as an algorithm. This algorithm has all the species as inputs and the species at any given point in time as the output. The master algo does the work of eliminating certain species, allowing certain species to mutate, etc. It is a dynamic process where the outputs are again fed as inputs. This line of thought makes one speculate the presence of a master algorithm. In fact one of the five tribes in the ML world, Evolutionaries, strongly believe in this way of modeling.

The argument from Physics : Most of the physics is driven by simple equations that prune away all the noise in the data and focus on the underlying beauty. Physics laws discovered in one domain are seamlessly applied to other domains. If everything we see in the nature could be explained by few simple laws, then it makes sense that a single algorithm can induce all the can be induced. All the Master Algorithm has to do is provide a shortcut to the laws’ consequences, replacing impossibly long mathematical derivations with much shorter ones based on actual observations. Another way to look at scientific disciplines is to think of various laws, states as outcomes of an dynamic optimization problem. However physics is unique in its simplicity. It’s only reasonably effective.

Machine learning is what you get when the unreasonable effectiveness of mathematics meets the unreasonable effectiveness of data.

The argument from Statistics : Bayesians look at the world from a probabilistic and learning mindset. Bayes rule is a recipe to turn data in to knowledge. In the yesteryears, Bayes applications were confined to simple applications. However with the rise of computing power, Bayes applications are now being used to a wide range of complex modeling situations. Is Bayes the master algorithm ? Well, there seems to many critics to the Bayesian approach of modeling. All said and done, it definitely appears that Bayesian inference will be a part of "Master Algorithm" in some way.

The argument from computer science : The author mentions the famous unsolved problem in computer science, P vs. NP. NP-complete are a set of problems that are equivalent in their computational hardness. If you solve one of the problems, you solve the rest. For decades, many mathematicians, researchers and scientists have been finding clever tricks to almost solve NP-complete problems. But the fundamental problem still eludes us -  Is the class of problems for which we can efficiently compute same as the class of problems for which we can efficiently check whether a solution exists ? If you read up on this problem, you will realize that all the NP-complete problems can be reduced to a satisfiability problem. If we invent a learner that can learn to solve satisfiability, it has a good claim to being the Master Algorithm.

NP-completeness aside, the sheer fact that a computer can do a gazillion tasks should make one confident about speculating the presence of a master algorithm that does the job across several problems. The author uses the example of Turing machine and says that back then, it was unthinkable to actually see a Turing machine in action. Turing machine can solve every conceivable problem that can be solved by logical deduction. The fact that we see these machines everywhere means that, despite the odds, we might see a Master Algorithm sometime in the future.

The Master Algorithm is for induction, the process of learning, what the Turing machine is for deduction. It can learn to simulate any other algorithm by reading examples of its input-output behavior. Just as there are many models of computation equivalent to a Turing machine, there are probably many different equivalent formulations of a universal learner. The point, however, is to find the first such formulation, just as Turing found the first formulation of the general-purpose computer.

Interesting analogy : The author is of the opinion that "human intuition" can’t replace data. There have been many instances where human intuition has gone terribly wrong and a guy with lots of data has done better. The author uses Brahe, Kepler and Newton’s work to draw a parallel to the machine learning.

Science goes through three phases, which we can call the Brahe, Kepler, and Newton phases. In the Brahe phase, we gather lots of data, like Tycho Brahe patiently recording the positions of the planets night after night, year after year. In the Kepler phase, we fit empirical laws to the data, like Kepler did to the planets’ motions. In the Newton phase, we discover the deeper truths. Most science consists of Brahe-and-Kepler-like work; Newton moments are rare. Today, big data does the work of billions of Brahes, and machine learning the work of millions of Keplers. If-let’s hope so-there are more Newton moments to be had, they are as likely to come from tomorrow’s learning algorithms as from tomorrow’s even more overwhelmed scientists, or at least from a combination of the two.

Critics of Master Algo : Well, for a concept as ambitious as Master Algo, there are bound to be critics and there are many. The author mentions a few of them as examples,

  • Knowledge engineers
  • Marvin Minsky
  • Naom Chomsky
  • Jerry Fodor

Hedgehog or Fox : One of the other questions that comes up when we think of "Master Algorithm" is whether it is a fox or a hedgehog. There are many studies that have shown that being a fox is far better than being a hedgehog. The hedgehog is synonymous with that of an "expert". In the context of this book though, a learning algorithm can be considered as "hedgehog" if variations of it can solve all the learning problems. The author hopes that the "Master Algorithm" turns out to a hedgehog.

Five tribes of ML : In the quest for master algorithm, we do not have to start from scratch. There are already many decades of ML research underway and each type of research community is akin to a tribe. The author describes five tribes, i.e. symbolists, connectivists, Bayesians, evolutionaries and analogizers. Each of the tribes uses its own master algo. Here is an illustration from the author’s presentation



But the real master algo that the author is hinting at is an algo that combines all the features of the tribes.

The most important feature of each of the tribe is that they firmly believe that theirs is the only way to model and predict. Unfortunately, this thinking hinders their ability to model a broad set of problems. For example, a Bayesian would find it extremely difficult to leave the probabilistic inference method and look at the problem from a evolutionary point of view. His thinking is forged based on priors, posteriors and likelihood functions. If a Bayesian were to look at an evolutionary algorithm like a genetic algo, he might not critically analyze it and adapt it to the problem at hand. This limitation is prevalent across all tribes. Analogizers love support vector machines but there are limited because they look for similarities of inputs across various dimensions; i.e. they are bound to be hit by curse of dimensionality. The same serpent,"the curse of dimensionality" that the author talks about in the previous chapters comes and bites each tribe, depending on the type of problem being solved.

The obvious question that arises in a reader’s mind is, can there be combination of tribes that come together to solve a specific set of problems ? Indeed the tribe categorization is not a hard categorization of the algorithms. It is just meant as a starting point so that you can place the gamut of algos in separate buckets.


Hume’s problem of Induction

The chapter starts with a discussion of "Rationalism vs. Empiricism". The rationalist likes to plan everything in advance before making the first move. The empiricist prefers to try things and see how they turn out. There are philosophers who strongly believe in one and not in the other. From a practical standpoint, there have been productive contributions to our world from both the camps. David Hume is considered to be one of the greatest empiricist of all time. In the context of Machine Learning, one of his questions has hung like a sword of Damocles over all the knowledge, which is,

How can we ever be justified in generalizing from what we’ve seen to what we haven’t?

The author uses a simple example where you have to decide to ask someone out for a date or not. The dataset used in the example illustrates Hume’s problem of induction, i.e. there is no reason to pick one generalization over another. So, a safe way out of the problem, at least to begin with, is to assume that future will be like the past. Is this enough ? Not really. In the ML context, the real problem is : How to generalize to cases that we haven’t seen before. One might think that by amassing huge datasets, you can solve this problem. However once you do the math, you realize that you will run out of data that covers all the cases needed to carry the inductive argument safely. Each new data point is most likely unique and you have no choice but to generalize. According to Hume, there is no way to do it

If this all sounds a bit abstract, suppose you’re a major e-mail provider, and you need to label each incoming e-mail as spam or not spam. You may have a database of a trillion past e-mails, each already labeled as spam or not, but that won’t save you, since the chances that every new e-mail will be an exact copy of a previous one are just about zero. You have no choice but to try to figure out at a more general level what distinguishes spam from non-spam. And, according to Hume, there’s no way to do that.

The "no free lunch" theorem : If you have been reading some general articles in the media on ML and big data, it is likely that you would have come across a view on the following lines:

With enough data, ML can churn out the best learning algo. You don’t have to have strong priors, the fact that you have large data is going to give you all the power to understand and model the world.

The author introduces David Wolpert’s "no free lunch" theorem that a limit on how good a learning algorithm can be. The theorem says that no learner can be better than random guessing. Are you surprised by this theorem ? Here is how one can reconcile to it,

Pick your favorite learner. For every world where it does better than random guessing, I, the devil’s advocate, will deviously construct one where it does worse by the same amount. All I have to do is flip the labels of all unseen instances. Since the labels of the observed ones agree, there’s no way your learner can distinguish between the world and the antiworld. On average over the two, it’s as good as random guessing. And therefore, on average over all possible worlds, pairing each world with its antiworld, your learner is equivalent to flipping coins.

How to escape the above the random guessing limit? Just care about the world we live in and don’t care about alternate worlds. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing. What are the implications of "free lunch theorem" in our modeling world ?

There’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.

Unwritten rule of Machine learning : The author states that the principle laid out by Newton in his work, "Principia", that serves as the first unwritten rule of ML

Whatever is true of everything we’ve seen is true of everything in the universe.

Newton’s principle is only the first step, however. We still need to figure out what is true of everything we’ve seen-how to extract the regularities from the raw data. The standard solution is to assume we know the form of the truth, and the learner’s job is to flesh it out. One of the ways to think about creating a form is via "conjunctive concepts", i.e a series of statements with AND as the bridge. The problem with "conjunctive concepts" is that they are practically useless. Real world is driven by "disjunctive concepts", i.e a concept defined by a set of rules. One of the pioneers in this approach of discovering rules was Ryszard Michalski, a Polish computer scientist. After immigrating to the United States in 1970, he went on to found the symbolist school of machine learning, along with Tom Mitchell and Jaime Carbonell.

Overfitting and Underfitting : The author uses the words "blindness" and "hallucination" to describe underfitting and overfitting models. By using ton of hypothesis, you can almost certainly overfit the data. On the other hand, being sparse in your hypothesis set, you can fail to see the true patterns in the data. This classic problem is obviated by doing out-of-sample testing. Is it good enough ? Well, that’s the best that is available without going in to the muddy philosophical debates or alternative pessimistic approaches like that of Leslie Valiant(author of Probably Approximately Correct).

Induction as inverse of deduction : Symbolists work via the induction route and formulate an elaborate set of rules. Since this route is computationally intensive for large dataset, the symbolists prefer something like decision trees. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. How do we then decide which concept the instance belongs to?

Decision trees are used in many different fields. In machine learning, they grew out of work in psychology. Earl Hunt and colleagues used them in the 1960s to model how humans acquire new concepts, and one of Hunt’s graduate students, J. Ross Quinlan, later tried using them for
chess. His original goal was to predict the outcome of king-rook versus king-knight endgames from the board positions. From those humble beginnings, decision trees have grown to be, according to surveys, the most widely used machine-learning algorithm. It’s not hard to see why: they’re easy to understand, fast to learn, and usually quite accurate without too much tweaking. Quinlan is the most prominent researcher in the symbolist school. An unflappable, down-to-earth Australian, he made decision trees the gold standard in classification by dint of relentlessly improving them year after year, and writing beautifully clear papers about them. Whatever you want to predict, there’s a good chance someone has used a decision tree for it.

The Symbolists : The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols. A mathematician solves equations by moving symbols around and replacing symbols by other symbols according to predefined rules. The same is true of a logician carrying out deductions. According to this hypothesis, intelligence is independent of the substrate.

Symbolist machine learning is an offshoot of the knowledge engineering school of AI. The use of computers to automatically learn the rules made the work of pioneers like Ryszard Michalski, Tom Mitchell, and Ross Quinlan extremely popular and since then the field has exploded

What are the shortcomings of inverse deduction?

  • The number of possible inductions is vast, and unless we stay close to our initial knowledge, it’s easy to get lost in space
  • Inverse deduction is easily confused by noise
  • Real concepts can seldom be concisely defined by a set of rules. They’re not black and white: there’s a large gray area between, say, spam and nonspam. They require weighing and accumulating weak evidence until a clear picture emerges. Diagnosing an illness involves giving more weight to some symptoms than others, and being OK with incomplete evidence. No one has ever succeeded in learning a set of rules that will recognize a cat by looking at the pixels in an image, and probably no one ever will.

An interesting example of a success from Symbolists is Eve, the computer that discovered malaria drug. There was a flurry of excitement a year ago, when an article, titled, Robot Scientist Discovers Potential Malaria Drug was published in Scientific American. This is the kind of learning that Symbolists are gung-ho about.


How does your brain learn ?

This chapter covers the second tribe of the five tribes mentioned in the book. This tribe is called "Connectionists". Connectionists are highly critical about the way Symbolists work as they think that describing something via a set of rules is just the tip of iceberg. There is lot more going under the surface that formal reasoning can’t see. Let’s say you come across the word "love", Symbolists would associate a rule with such a concept whereas Connectionists would associate various parts of the brain to such a concept. In a sense, there is no one to one correspondence between a concept and a symbol. Instead the correspondence is many to many. Each concept is represented by many neurons, and each neuron participates in representing many different concepts. Hebb’s rule is the corner stone of connectionists. In a non-math way, it says that "Neurons that fire together stay together". The other big difference between Symbolists and Connectionists is that the former tribe believes in sequential processing whereas the latter tribe believes in parallel processing.

To get some basic understanding of the key algos used by connectionists, it is better to have a bit of understanding of the way neuron is structured in our brain. Here is a visual that I picked up from the author’s presentation :


The branches of the neuron connect to others via synapses and basic learning takes place via synaptic connections. The first formal model of a neuron was proposed by Warren McCulloch and Walter Pitts in 1943. It looked a lot like the logic gates computers are made of. The problem with this model was that the model did not learn. It was Frank Rosenblatt who came up with the first model of learning by giving variable weights to the connections between neurons. The following is a good schematic diagram of the perceptron:


This model generated a lot of excitement and ML received a lot of funding for various research projects. However this excitement was short lived. Marvin Minsky and few others published many examples where perceptron failed to learn. One of the most simple and dangerous example that perceptron could not learn was XOR operator. Perceptron was mathematically unimpeachable, searing in its clarity, and disastrous in its effects. Machine learning at the time was associated mainly with neural networks, and most researchers (not to mention funders) concluded that the only way to build an intelligent system was to explicitly program it. For the next fifteen years, knowledge engineering would hold center stage, and machine learning seemed to have been consigned to the ash heap of history.

Fast forward to John Hopfield work on spin glasses, there was a reincarnation of perceptron

Hopfield noticed an interesting similarity between spin glasses and neural networks: an electron’s spin responds to the behavior of its neighbors much like a neuron does. In the electron’s case, it flips up if the weighted sum of the neighbors exceeds a threshold and flips (or
stays) down otherwise. Inspired by this, he defined a type of neural network that evolves over time in the same way that a spin glass does and postulated that the network’s minimum energy states are its memories. Each such state has a "basin of attraction" of initial states that converge to it, and in this way the network can do pattern recognition: for example, if one of the memories is the pattern of black-and-white pixels formed by the digit nine and the network sees a distorted nine, it will converge to the "ideal" one and thereby recognize it. Suddenly, a vast body of physical theory was applicable to machine learning, and a flood of statistical physicists poured into the field, helping it break out of the local minimum it had been stuck in.

The author goes on to describe "Sigmoid" function and its ubiquitous nature. If you think about the curve for sometime, you will find it everywhere. I think the first time I came across this function was in Charles Handy’s book, "The Age of Paradox". Sigmoid functions in that book are used to describe various types of phenomenon that show an exponential slow rate of increase in the beginning, then a sudden explosive rate of increase and subsequently with an exponential rate of decrease. Basically if you take the first derivative of the Sigmoid function, you get the classic bell curve. I think the book,"The Age of Paradox" had a chapter with some heavy management gyan that went something like – "you need to create another Sigmoid curve in your life before the older Sigmoid curve starts a downfall" or something to that effect. I don’t quite recollect the exact idea from Charles Handy’s book, but there is a blog post by Bret Simmons, titled The Road to Davy’s Bar that goes in to related details.

Well, in the context of ML, the application of Sigmoid curve is more practical. It can be used to replace the step function and suddenly things become more tractable. A single neuron can learn a straight line but a set of neurons, i.e multi-layer perceptron can learn more convoluted curves. Agreed there is a curse of dimensionality here, but if you think about it, the hyperspace explosion is a double edged sword. On the one hand, there objective function is far more wiggly but on the other hand, there is a less scope that you will stuck at a local minimum via gradient search methods. With this Sigmoid input and multi layer tweak, Perceptron came back with vengeance. There was a ton of excitement just like the time when perceptron was introduced. The algorithm by which the learning takes place is called "back propagation", a term that is analogous to how human brains work. This algo was invented by David Rumelhart in 1986. It is a variant of gradient descent method. There is no mathematical proof that Back propagation will find the global minimum/maximum, though. The backprop solves what the author calls "credit assignment" problem. In a multi-layered perceptron the error between the target value and the current value needs to be propagated across all layers backward. The basic idea of error propagation, i.e error assignment for each of the layers is done via backprop.

Whenever the learner’s "retina" sees a new image, that signal propagates forward through the network until it produces an output. Comparing this output with the desired one yields an error signal, which then propagates back through the layers until it reaches the retina. Based on this returning signal and on the inputs it had received during the forward pass, each neuron adjusts its weights. As the network sees more and more images of your grandmother and other people, the weights gradually converge to values that let it discriminate between the two.

Sadly the excitement phase petered down as learning with dozens of hundreds of hidden layers was computationally difficult. In the recent years though, backpropagation has made a comeback thanks to huge computing power and big data. It now goes by the technique "Deep Learning". The key idea of deep learning is based on auto encoders that is explained very well by the author. However there are many things that need to be worked out for deep learning to be anywhere close to the Master algorithm. All said and done there are a few limitations to exclusively following a connectionist tribe. Firstly, the learning algo is difficult to comprehend. It comprises convoluted connections between various neurons. The other limitation is that the approach is not compositional, meaning it is divorced from the way a big part of human cognition works.


Evolution : Nature’s Learning Algorithm

The chapter starts with the story of John Holland, the first person to have earned a PhD in computer science in 1959. Holland is known for his immense contribution to Genetic algorithms. His key insight lay in coming up with a fitness function that would assign a score to every program considered. What’s the role of fitness function ? Starting with a population of not-very-fit individuals-possibly completely random ones-the genetic algorithm has to come up with variations that can then be selected according to fitness. How does nature do that? This is where the genetic part of the algorithm comes in. In the same way that DNA encodes an organism as a sequence of base pairs, we can encode a program as a string of bits. Variations are produced by crossovers and mutations. The next breakthrough in the field of genetic programming came from Holland’s student John Koza who came up with the idea of evolving full blown computer programs.

Genetic programming’s first success, in 1995, was in designing electronic circuits. Starting with a pile of electronic components such as transistors, resistors, and capacitors, Koza’s system reinvented a previously patented design for a low-pass filter, a circuit that can be used for things like enhancing the bass on a dance-music track. Since then he’s made a sport of reinventing patented devices, turning them out by the dozen. The next milestone came in 2005, when the US Patent and Trademark Office awarded a patent to a genetically designed factory optimization system. If the Turing test had been to fool a patent examiner instead of a conversationalist, then January 25, 2005, would have been a date for the history books. Koza’s confidence stands out even in a field not known for its shrinking violets. He sees genetic programming as an invention machine, a silicon Edison for the twenty-first century.

A great mystery in genetic programming that is yet to be solved conclusively is the role of crossover. None of Holland’s theoretical results show that crossover actually helps; mutation suffices to exponentially increase the frequency of the fittest schemas in the population over time. There were other problems with genetic programming that finally made ML community at large divorce itself from this tribe

Evolutionaries and connectionists have something important in common: they both design learning algorithms inspired by nature. But then they part ways. Evolutionaries focus on learning structure; to them, fine-tuning an evolved structure by optimizing parameters is of secondary importance. In contrast, connectionists prefer to take a simple, hand-coded structure with lots of connections and let weight learning do all the work. This is machine learning’s version of the nature versus nurture controversy. As in the nature versus nurture debate, neither side has the whole answer; the key is figuring out how to combine the two. The Master Algorithm is neither genetic programming nor backprop, but it has to include the key elements of both: structure learning and weight learning. So, is this it ? Have we stumbled on to the right path for "Master Algorithm" ? Not quite. There are tons of problems with evolutionary algos. Symbolists and Bayesians do not believe in emulating nature. Rather, they want to figure out from first principles what learners should do. If we want to learn to diagnose cancer, for example, it’s not enough to say "this is how nature learns; let’s do the same." There’s too much at stake. Errors cost lives. Symbolists dominated the first few decades of cognitive psychology. In the 1980s and 1990s, connectionists held sway, but now Bayesians are on the rise.


In the Church of the Reverend Bayes

Perci Diaconis in his paper titled, MCMC Revolution, says that MCMC technique that came from Bayesian tribe has revolutionized applied mathematics. Indeed, thanks to high performance computing ability, Bayes is now a standard tool in any number cruncher’s tool kit. This chapter talks about various types of Bayesian techniques. The basic idea behind Bayes is that it is a systematic and quantified way of updating degrees of belief, in the light of new data. You can pretty much cast any problem, irrespective of the size of the data available, in to a Bayesian inference problem. Bayes theorem usually goes by the name "inverse probability" because in real life we know Pr(effect|cause) and we are looking to compute Pr(cause|effect). Bayes’ theorem as a foundation for statistics and machine learning is bedeviled not just by computational difficulty but also by extreme controversy. The main point of conflict between Bayesians and Non-Bayesians is the reliance of subjective priors to "turn on the Bayesian crank". Using subjective estimates as probabilities is considered sin by Frequentists, for whom, everything should be learned from the data.

One of the most common variant of Bayesian models is the "Naive Bayes" model where each cause is independent of other causes in creating an effect. Even though this assumption sounds extremely crazy, there are a ton of areas where Naive Bayes beats sophisticated models. No one is sure who invented the Naïve Bayes algorithm. It was mentioned without attribution in a 1973 pattern recognition textbook, but it only took off in the 1990s, when researchers noticed that, surprisingly, it was often more accurate than much more sophisticated learners. Also if you reflect a bit, you will realize that Naive Bayes is closely related to Perceptron algorithm.

The author mentions Markov Models as the next step in the evolution of Bayes models. Markov models are applicable to a family of random variables where each variable is conditionally independent of its history except the current state. Markov chains turn up everywhere and are one of the most intensively studied topics in mathematics, but they’re still a very limited kind of probabilistic model. A more complicated model is Hidden Markov Model where we don’t get to see the actual states but we have to infer them from the observations. A continuous version of HMM goes under the name "Kalman Filter" that has been used in many applications across domains.

Naive Bayes, Markov Models, Hidden Markov Models are all good but they are all a far cry from Symbolists. The next breakthrough came from Judea Pearl who invented Bayesian Networks. This allowed one to specify complex dependencies among random variables. By defining the conditional independence of a variable given a set of neighboring nodes, Bayesian networks tame the combinatorial explosion and make inferences tractable. Basically Bayesian Network can be thought of as a "generative model", a recipe for probabilistically generating a state of the world. Despite the complex nature of a Bayesian net, the author mentions that there have been techniques developed to successfully infer various aspects of the network. In this context, the author mentions MCMC and gives an intuitive explanation of the technique. A misconception amongst many is that MCMC is a simulation technique. Far from it, the procedure does not simulate any real process; rather it is an efficient way to generate samples from a Bayesian network. Inference in Bayesian networks is not limited to computing probabilities. It also includes finding the most probable explanation for the evidence. The author uses the "poster child" example of inferring the probability of heads from coin tosses to illustrate the Bayesian technique and compare it with the Frequentist world of inference.

The next set of models that came to dominate the Bayesian tribe is Markov Networks. A Markov network is a set of features and corresponding weights, which together define a probability distribution. Like Bayesian networks, Markov networks can be represented by graphs, but they have undirected arcs instead of arrows. Markov networks are a staple in many areas, such as computer vision. There are many who feel that Markov networks are far better than Naive Bayes, HMMs etc., as they can capture the influence from surroundings.

Bayesians and symbolists agree that prior assumptions are inevitable, but they differ in the kinds of prior knowledge they allow. For Bayesians, knowledge goes in the prior distribution over the structure and parameters of the model. In principle, the parameter prior could be anything we please, but ironically, Bayesians tend to choose uninformative priors (like assigning the same probability to all hypotheses) because they’re easier to compute with. For structure, Bayesian networks provide an intuitive way to incorporate knowledge: draw an arrow from A to B if you think that A directly causes B. But symbolists are much more flexible: you can provide as prior knowledge to your learner anything you can encode in logic, and practically anything can be encoded in logic-provided it’s black and white.

Clearly, we need both logic and probability. Curing cancer is a good example. A Bayesian network can model a single aspect of how cells function, like gene regulation or protein folding, but only logic can put all the pieces together into a coherent picture. On the other hand, logic can’t deal with incomplete or noisy information, which is pervasive in experimental biology, but Bayesian networks can handle it with aplomb.Combining connectionism and evolutionism was fairly easy: just evolve the network structure and learn the parameters by backpropagation. But unifying logic and probability is a much harder problem.


You are what you resemble

The author introduces techniques of the "Analogizers" tribe. This tribe uses similarities among various data points to categorize them in to distinct classes. In some sense, we all learn by analogy. Every example that illustrates an abstract concept is like an analogy. We learn by relating the similarity between two concepts and then figure what else one can infer based on the fact that two concepts are similar.

The chapter begins with explaining the most popular algorithm of the tribe, "the nearest neighbor algorithm". This was invented way back in 1951 by Evelyn Fix and Joe Hodges. The inventors faced a massive difficulty in publishing their algorithm. However the fact that the algo remained unpublished did not faze many researchers who went about developing variants of the algorithm like "K nearest neighbor" method, "Weighted K-nearest neighbor" etc. It was in 1967 that Tom Cover and Peter Hart proved that, given enough data, nearest-neighbor is at worst only twice as error-prone as the best imaginable classifier. This was a momentous revelation. Up until then, all known classifiers assumed that the frontier had a very specific form, typically a straight line. This was a double-edged sword: on the one hand, it made proofs of correctness possible, as in the case of the perceptron, but it also meant that the classifier was strictly limited in what it could learn. Nearest-neighbor was the first algorithm in history that could take advantage of unlimited amounts of data to learn arbitrarily complex concepts. No human being could hope to trace the frontiers it forms in hyperspace from millions of examples, but because of Cover and Hart’s proof, we know that they’re probably not far off the mark.

Is nearest neighbor algo, the master algorithm ? It isn’t because of curse of dimensionality. As the dimension of covariates goes up, the NN algo efficiency goes down. In fact the curse of dimensionality is the second most important stumbling block in the Machine learning, over-fitting being the first one. There are certain techniques to handle the dimension explosion but most of them are hacks and there is no guarantee that they are going to work.

Subsequently, the author introduces Support Vector Machines(SVM) that has become the most popular technique used by Analogizers. I loved the way author describes this technique using plain simple English. He asks the reader to visualize a fat serpent that moves between two countries that are at war. The story of finding the serpent incorporates pretty much all the math that is needed to compute support vectors, i.e.

  • kernel for SVM
  • support vectors
  • weight of the support vectors
  • constrained optimization
  • maximizing the margin of the classifier

My guess is, one would understand the math far easier, after reading through this section on SVMs. SVMs have many advantages and the author highlights most of them. Books such as these also help us in verbalizing math stuff in simple words. For example, if you were to explain the difference between constrained optimization and unconstrained optimization to a taxi driver, how would you do it? Read this book to check whether your explanation is better than what the author provides.

Towards the end of the chapter, the author talks about case-based reasoning and says that in the years to come, analogical reasoning will become so powerful that it will sweep through all the fields where case-based reasoning is still employed.


Learning without a teacher

Unlike the previous chapters that focused on labeled data, this chapter is learning via unsupervised learning. Cognitive scientists describe theories of child learning using algos and machine learning researchers have developed techniques based on them. The author explains k-means algorithm, a popular clustering technique. It is actually a special case of Expectation Maximization(EM) algorithm that was invented by three Harvard statisticians. EM is used in a ton of places. To learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on them. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM. Once you have a cluster at the macro level, nothing stops you from using the same algo for each cluster and come up with sub-clusters etc.

Subsequently, the author introduces another popular technique for unsupervised learning, PCA, that is used for dimensional reduction. PCA tries to come up linear combination of various dimensions in the hyperspace so that total variance of the data across all dimensions is maximized. A step up to this algo is called "Isomap", a nonlinear dimensionality reduction technique. It connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network and finds the reduced coordinates that best approximate these distances.

After introducing clustering and dimensional reduction techniques, the author talks about "Reinforcement learning", a technique that relies on immediate response of the environment for various actions of the learner. Research on reinforcement learning started in earnest in the early 1980s, with the work of Rich Sutton and Andy Barto at the University of Massachusetts. They felt that learning depends crucially on interacting with the environment, but supervised algorithms didn’t capture this, and they found inspiration instead in the psychology of animal learning. Sutton went on to become the leading proponent of reinforcement learning. Another key step happened in 1989, when Chris Watkins at Cambridge, initially motivated by his experimental observations of children’s learning, arrived at the modern formulation of reinforcement learning as optimal control in an unknown environment. A recent example of a successful startup that combines neural networks and reinforcement learning is "DeepMind", a company that was acquired by Google for half a billion dollars.

Another algorithm that has a potential to be a part of "Master Algorithm" is chunking. Chunking remains a preeminent example of a learning algorithm inspired by psychology. The author gives a basic outline of this concept. Chunking and reinforcement learning are not as widely used in business as supervised learning, clustering, or dimensionality reduction, but a simpler type of learning by interacting with the environment is: A/B testing. The chapter ends with the author explaining another potentially killer algo, "relational learning".


The Pieces of the Puzzle Fall into Place

Progress in science comes from unifying theories; two or more seemingly disparate observations are driven by the same logic or law. If one looks at ML, it appears that Master Algorithm is akin to a unifying theory in science. It will unify all the master algorithms of each tribes, all the techniques of each tribes and give one cohesive way to learn from data.

In fact there is already a technique called "meta learning" that some of the tribes use with in their techniques. For example, bagging, random forests and boosting are some of the famous meta learning techniques used by Symbolists. Bayesians have something called "model averaging" that learns from each model by considering it as an hypothesis and then computes a score based on the vote given by each of the models. Meta learning in its current avatar is remarkably successful, but it’s not a very deep way to combine models. It’s also expensive, requiring as it does many runs of learning, and the combined models can be quite opaque.

The author used the following schematic diagram for each of the tribes, while explaining the rationale of a possible “Master Algorithm”


He then takes the reader through a tour of each of the tribes philosophy and their master algorithms and comes up a unifier, called "Alchemy", that he calls it as the "Master Algorithm". In the process of creating this master algorithm, he introduces Markov Logic Networks and says that they serve for representing the problem. Alchemy uses posterior probability as the evaluation function, and genetic search coupled with gradient descent as the optimizer. The author is wary about Alchemy’s immediate application and says that there is a ton of research that is yet to be done, so that it can become a true Master Algorithm, i.e., one that has the capability to solve hard problems.

This chapter is a little more involved as it tries to connect all the ideas from the previous eight chapters and introduces a way to combine the various pieces of puzzle to create a "Master Algorithm". The chapter will also be very interesting for an aspiring ML researcher  who is trying to pick his focus area.


This is the World on Machine Learning

The last chapter of the book discusses the world in which "Master Algorithm" is all pervasive. The author tries to speculate answers to the following questions :

  • Will humans be replaced by machines ?
  • What do you want the learners of the world to know about you ?
  • How good a model of you, a learner, can have ?
  • How will ecommerce shopping experience be ?
  • Will there be a rise of "humanities" disciple after the automation of most of non-human related tasks ?
  • What will the current online dating sites morph in to ?
  • How will the Amazons, Netflixes, Googles of the world change ?
  • What will be the privacy issues in a society where most of the transactions and activities involve, one algo talking to another algo?
  • Will future wars be fought by robots ?
  • Will robot-warfare be viable ?
  • Will AI and Master Algorithm take over the world ?

The author ends the book by saying,

Natural learning itself has gone through three phases: evolution, the brain, and culture. Each is  product of the previous one, and each learns faster. Machine learning is the logical next stage of this progression. Computer programs are the fastest replicators on Earth: copying them takes  only a fraction of a second. But creating them is slow, if it has to be done by humans. Machine learning removes that bottleneck, leaving a final one: the speed at which humans can absorb change.

takeawayTakeaway :

This book is a pop-science book for Machine Learning. ML has reached a point where it is not just for geeks anymore. Every one needs to know about it, Every one needs to at least have a conceptual model about the field as it has become all pervasive. Having said that, it would take time to plod though the book if you are a complete newbie to ML. This book is massively appealing to someone who a cursory knowledge of a few ML techniques and wants to have a 10,000 ft. view of the entire fields’ past, present and future. The future, as the title of book states is, would be the invention of a master algorithm that unifies methods across all the five tribes.





imageTakeaway :

This book is a beautiful book that describes the math behind queueing systems. One learns a ton of math tools from this book, that can be used to analyze any system that has a queueing structure within it. The author presents the material in a highly enthusiastic tone with superb clarity. Thoroughly enjoyed going through the book.


In the last few decades, enormous computational speed has become accessible to many. Modern day desktop has good enough memory and processing speed that enables a data analyst to compute probabilities and perform statistical inference by writing computer programs. In such a context, this book can serve as a starting point to anyone who wishes to explore the subject of computational probability. This book has 21 puzzles that can be solved via simulation.

Solving a puzzle has its own advantages. Give a dataset with one dependent variable and a set of predictors to a dozen people asking them to fit a regression model; I bet that you will see at least a dozen models, each of which could be argued as a plausible model. Puzzles are different. There are constraints put around the problem that you are forced to get that ONE RIGHT solution to the problem. In doing so, you develop much more sophisticated thinking skills.

In the introductory chapter of the book, the author provides a basic framework for computational probability by showing ways to simulate and compute probabilities. This chapter gives the reader all the ammunition required to solve the various puzzles of the book. The author provides detailed solutions that includes relevant MATLAB code, to all the 21 puzzles.

Some of my favorite puzzles from the book that are enlightening as well as paradoxical are :

  • ˆ The Gamow-Stern Elevator
  • ˆ The Pipe Smoker’s Discovery
  • ˆ A Toilet Paper Dilemma
  • ˆ Parrondo’s Paradox
  • ˆ How Long Is the Wait to Get the Potato Salad ?
  • ˆ The Appeals court Paradox

Here is the link to my document that flushes out the details of all the 21 puzzles in the book:

What’s in the above document?

I have written R code that aims to computationally solve each of the puzzles in the book. For each puzzle, there are two subsections. First subsection spells out my attempt at solving the puzzle. The second subsection contains my learning from reading through the solution given by the author. The author provides extremely detailed MATLAB code that anyone who has absolutely no exposure to MATLAB can also understand the logic. In many cases I found that the code snippets in the book looked like elaborate pseudo code. There are many good references mentioned for each of the puzzles so that interested readers can explore further aspects. In most of the cases, the reader will realize that closed form solutions are extremely tedious to derive and simulation based procedures make it easy to obtain solutions to many intractable problems.



In a book that has about 350 pages, the first 250 odd pages are devoted to probability, ODEs and difference equations. The last part of the book covers queuing theory for specific systems, i.e, Poisson arrivals, exponential service times of one or more servers. The most painful thing about this book is that there are innumerable typos. A book that is riddled with typos on almost every other page cannot be an appealing text for an undergrad.  My guess is that, this book will be never make it to an undergrad’s study table, unless the authors make a serious effort to publish an errata or come up with a better version of the book. Is there anything good about the book at all ? Well, may be, the chapter on difference equations is worth going over just once. On a second thought, I think that the first 250 pages of the book can be rewritten concisely so that it can be pushed to the appendix.  That leaves the last 100 pages of the book that reads more like a cheat sheet rather than a book from which one can really learn something. This book desperately needs a rewrite from the authors,else it is going to languish alongside the books that die silently every year.



Every year there are at least a dozen pop math/stat books that get published. Most of them try to illustrate a variety of mathematical/statistical principles using analogies/anecdotes/stories that are easy to understand. It is a safe assumption to make that the authors of these books spend a considerable amount of time thinking about the apt analogies to use, those that are not too taxing on the reader but at the same time puts across the key idea. I tend to read at least one pop math/stat book in a year to whet my “analogy appetite”. It is one thing to write an equation about some principle and a completely different thing to be able to explain a math concept to somebody. Books such as these help in building one’s “analogy” database so that one can start seeing far more things from a math perspective. The author of this book, Jordan Ellenberg, is a math professor at University of Wisconsin-Madison and writes a math column for “Slate”. The book is about 450 odd pages and gives a ton of analogies. In this post, I will try to list down the analogies and some points made in the context of several mathematical principles illustrated in the book.

  • Survivorship bias
    • Abraham Wald’s logic of placing armor on engines that had no bullet holes
    • Mutual funds performance over a long period
    • Baltimore stockbroker parable
  • Linearity Vs. Nonlinear behavior
    • Laffer curve
  • Notion of limits in Calculus
    • Zeno’s Paradox
    • Augustin-Louis Cauchy’s and his work on summing infinite series
  • Regression
    • Will all Americans become obese? The dangers of extrapolation
    • Galton Vs. Secrist – “Regression towards mediocrity” observed in the data but both had different explanations. Secrist remained in the dark and attributed mediocrity to whatever he felt like. Secretist thought the regression he painstakingly documented was a new law of business physics, something that would bring more certainty and rigor to the scientific study of commerce. But it was just the opposite. Galton on the other hand was a mathematician and hence rightly showed that in the presence of a random effect, the regression towards mean is a necessary fact. Wherever there is a random fluctuation, one observes regression towards mean, be it mutual funds, performance of sportsmen, mood swings etc.
    • Correlation is non-transitive. Karl Pearson idea using geometry makes it easy to prove.
    • Berkson’s fallacy – Why handsome men are jerks? Why popular novels are terrible?


  • Law of Large numbers
    • Small school vs. Large school performance comparison
  • Partially ordered sets
    • Comparing disasters in human history
  • Hypothesis testing + “P value” + Type I error ( seeing a pattern where there is none) + Type II error(missing a pattern when there is one)
    • Experimental data from dead fish fMRI measurement: Dead fish have the ability to correctly assess the emotions the people in the pictures displayed. Insane conclusion that passes statistical tests
    • Torah dataset (304,8500 letter document) used by a group of researchers to find hidden meanings beneath the stories, genealogies and admonitions. Dangers of data mining.
    • Underpowered test : Using binoculars to detect moons around Mars
    • Overpowered test: If you study a large sample size, you are bound to reject null as your dataset will enable you to see ever-smaller effects. Just because you can detect them doesn’t mean they matter.
    • “Hot hand” in basketball : If you ask the right question, it is difficult to detect the effect statistically. The right question isn’t “Do basket players sometimes temporarily get better or worse at making shots? – the kind of yes/no question a significance test addresses. { Null – No “hothand”, Alternate : “Hot hand” } is an underpowered test . The right question is “How much does their ability vary with time, and to what extent can observers detect in real time whether a player is hot”? This is a tough question.
    • Skinner rejected the hypothesis that Shakespeare did not alliterate!
    • Null Hypothesis Significance testing, NHST,is a fuzzy version of “Proof by contradiction”
    • Testing whether a set of stars in one corner of a constellation (Taurus) is grouped together by chance?
    • Parable by Cosma Shalizi : Examining the livers of sheep to predict about future events. Very funny way to describe what’s going with the published papers in many journals
    • John Ioannidis Research paper “Why most Published Researched Findings Are False”?
    • Tests of genetic association with disease – awash with false positives
    • Example of a low powered study : Paper in Psychological science( a premier journal) concluded that “Married woman were more likely to support Mitt Romney when they were in the fertile portion of their ovulatory cycle”!
    • Low powered study is only going to be able to see a pretty big effect. But sometimes you know that the effect, if it exists, is small. In other words, a study that accurately measures the effect of a gene is likely to be rejected as statistically insignificant, while any result that passes the pvalue test is either a false positive or a true positive that massively overstates the effect
    • Uri Simonsohn, a professor at Penn brilliantly summarizes the problem of replicability as “p-hacking”(somehow getting it to the 0.05 level that enables one to publish papers)


    • In 2013, the association for Psychological science announced that they would start publishing a new genre of articles, called Registered Replication Reports. These reports aimed at reproducing the effects reported in widely cited studies, are treated differently from usual papers in a crucial way: The proposed experiment is accepted for publication before the study is carried out. If the outcomes support the initial finding, great news, but if not they are published anyway so that the whole community can know the full state of the evidence.
  • Utility of Randomness in math
    • “Bounded gaps” conjecture: Is there a bound for the gap between two primes? Primes get rarer and rarer as we chug along integer axis. Then what causes the gap to be bounded?
    • How many twin primes are there in the first N numbers (Among first N numbers, about N/log N are prime)?
    • Mysteries of prime numbers need new mathematical ideas that structure the concept of structurelessness itself
  • How to explain “Logarithm” to a kid? The logarithm of a positive integer can be thought as the number of digits in the positive number.
  • Forecast performance
    • Short term weather forecasts have become a possibility, given the explosion of computing power and big data. However any forecast beyond 2 weeks is dicey. On the other hand, the more data and computing power you have , some problems might yield highly accurate forecasts such as prediction of the course of an asteroid. Whatever domain you work in, you need to consider where does your domain lie between these two examples, i.e. one where big data + computing power helps and the second where big data + computing power + whatever is needed does not help you get any meaningful forecast beyond a short term forecast.
  • · Recommendation Algorithms
    • After decades of being fed with browsing data, recommendations for almost all the popular sites suck
    • Netflix prize, an example that is used by many modern Machine learning 101 courses It took 3 years of community hacking to improve the recommendation algo. Sadly the algo was not put to use by Netflix. The world moved on in three years and Netflix was streaming movies online, which makes dud recommendations less of a big deal.
  • Bayes theorem
    • Which Facebook users are likely to be involved in terrorist activities? Facebook assigns a probability that each of its users is associated with terrorist activities. The following two questions have vastly different answers. You need to be careful about what you are asking.
      1. What is the chance that a person gets put on a Facebook’s list, given that they are not a terrorist?
      2. What’s the chance that a person’s not a terrorist, given that they are on Facebook list ?
    • Why one must go Bayes? P(Data/Null) is what frequentist answers , P(Null/Data) is what a Bayesian answers
    • Are Roulette wheels biased? Use priors and experimental data to verify the same
  • Expected Value
    • Lottery ticket pricing
    • Cash WinFall : How a few groups hijacked the Massachusetts State Lottery ? Link : Boston Globe, that explains why it turned out to be a private lottery.
    • Use the additivity law of expectation to solve Buffon’s Needle problem
  • Utility curve
    • If you miss your flight, how to quantify your annoyance level?
    • Utility of dollars earned for guy moonlighting is different from that of a tenured professor
    • St Petersburg paradox
  • Error correction coding , Hamming code, Hamming distance, Shannon’s work :
    • Reducing variance of loss in Cash WinFall lottery : Choosing the random numbers with less variance is a computationally expensive problem if brute force is used. Information theory and Projective geometry could be the basis on which the successful MIT group generated random numbers that had less variance while betting.
    • Bertillion’s card system to identify criminals and Galton’s idea that redundancy in the card can be quantified, were formalized by Shannon who showed that the correlation between variables reduces the informativeness of a card
  • Condorcet Paradox
    • Deciding a three way election is riddled with many issues. There is no such thing as the public response. Electoral process defines the public response and makes peace with the many paradoxes that are inherent in deciding the public response.

Quotes from the book:

  • Knowing mathematics is like wearing a pair of X-ray specs that reveal hidden structures underneath the messy and chaotic surface of the world
  • Mathematics is the extension of common sense. Without the rigorous structure that math provides, common sense can lead you astray. Formal mathematics without common sense would turn math computations in to sterile exercise.
  • It is pretty hard to understand mathematics without doing mathematics. There is no royal road to any field of math. Getting your hands dirty is a prerequisite
  • People who go into mathematics for fame and glory don’t stay in mathematics for long
  • Just because we can assign whatever meaning we like to a string of mathematical symbols doesn’t mean we should. In math, as in life, there are good choices and there are bad ones. In the mathematical context, the good choices are the ones that settle unnecessary perplexities without creating new ones
  • We have to teach math that values precise answers but also intelligent approximation, that demands the ability to deploy existing algorithms fluently but also the horse sense to work things out on the fly that mixes rigidity with a sense of play. If we don’t do teach it that way, we are not teaching mathematics at all.
  • Field Medalist David Mumford: Dispense plane geometry entirely from the syllabus and replace it with a first course in programming.
  • “Statistically noticeable” / “Statistically detectable” is a better term than using “Statistically significant”. This should be the first statement that must be drilled in to any newbie taking stats101 course.
  • If gambling is exciting, you are doing it wrong – A powerful maxim applicable for people looking for investment opportunities too. Hot stocks provide excitement and most of the times that is all they do.
  • It is tempting to think of “very improbable” as meaning “essentially impossible”. Sadly NHST makes us infer based on “very improbable observation”. One good reason why Bayes is priceless in this aspect
  • One of the most painful aspects of teaching mathematics is seeing my students damaged by the cult of the genius. That cult tells students that it’s not worth doing math unless you’re the best at math—because those special few are the only ones whose contributions really count. We don’t treat any other subject that way. I’ve never heard a student say, "I like ‘Hamlet,’ but I don’t really belong in AP English—that child who sits in the front row knows half the plays by heart, and he started reading Shakespeare when he was 7!" Basketball players don’t quit just because one of their teammates outshines them. But I see promising young mathematicians quit every year because someone in their range of vision is "ahead" of them. And losing mathematicians isn’t the only problem. We need more math majors who don’t become mathematicians—more math-major doctors, more math-major high-school teachers, more math-major CEOs, more math-major senators. But we won’t get there until we dump the stereotype that math is worthwhile only for child geniuses

The book ends with a quote from Samuel Beckett





We see/hear/talk about “Information”  in many contexts. In the last two decades or so, one can also go and make a career in the field of “Information” technology. But what is “Information” ? If someone talks about a certain subject for 10 minutes in English and 10 minutes in French, Is the “Information” same in both the instances?. Can we quantify the two instances in someway ? This book explains Claude Shannon’s remarkable achievement of measuring “Information” in terms of probabilities. Almost 50 years ago, Shannon laid out a mathematical framework and it was an open challenge for engineers to develop devices and technologies that Shannon proved as a “mathematical certainty”. This book distils the main ideas that go in to quantifying information with very little math and hence makes it accessible to a wider audience. A must read if you are curious about knowing a bit about “Information” which has become a part of every day’s vocabulary.




image Takeaway :

I think this book needs to be read after having some understanding of BUGS software and also having some R/S programming skills. That familiarity can help you simulate and check for yourself the various results and graphs, the author uses to illustrate Bayesian concepts. The book starts by explaining the essence of any econometric model and the way in which an econometrician has to put in assumptions to obtain posterior distribution of various parameters. The core of the book is covered in three chapters, the first two chapters covering model estimation and model checking, and the fourth chapter of the book covering MCMC techniques. The rest of the chapters cover linear models, non linear models and time series models. There are two chapters, one on Panel data and one on Instrument variables that are essential for a practicing econometrician for tackling the problem of endogenous variables. BUGS code for all the models explained in the book are given in the appendix and hence the book can serve as a quick reference for BUGS syntax. Overall a self- contained book and a perfect book to start on Bayesian econometric analysis journey.



The author is a CS professor at SUNY, Stony Brook. This book recounts his experience of building a mathematical system to bet on the play outcomes of what is considered the fastest ball game in the world, “Jai alai”. In the English vernacular this is sometimes spelled as it sounds,that is, “hi-li”.  The book recounts the history of the game and how it made to US from Spain and France. However the focus of the book is on using mathematical modeling and computers to analyze the game and design a betting system. The game itself is designed in such a way that it is a textbook case for analyzing the game mathematically. The players enter the competition based on FIFO queue and the player who gets to score 7 points is the winner. It takes hardly a few minutes to understand the game from this wiki.

With the help of some of his grad students, the author works on the following questions :

  • Given a player starts in a specific position, what is probability that he ends up in a Win/Place/Show ?
  • What are the best combination of numbers that have the highest probability of winning a Trifecta ?
  • How does one build a statistical model to evaluate the relative skills of the players ?
  • Given that two players A and B have probabilities of winning as pb and pb, How does one construct a model that evaluates the probability of A winning over B ?
  • How does one create a payoff model for the various bets that are allowed in the game ?
  • How do you deal with missing  / corrupt data ?
  • Given the 1) payoffs of various bets, 2) the probabilities of a player winning from a specific position, and 3) the relative skillsets, how does one combine all of these elements to create a betting strategy ?

I have just outlined a few of the questions from the entire book. There are numerous side discussions that makes the book a very interesting read. Here is one of the many examples from the book that I found interesting :

Almost every person who learns to do simulation comes across Linear congruential generator(LCG), one of the basic number theory technique to generate pseudo random numbers. It has the following recursion form :

By choosing appropriate values for a, c and n, one can generate pseudo random numbers.

The book connects the above recursive form to a roulette wheel :

Why do casinos and their patrons trust that roulette wheels generate random numbers? Why can’t the fellow in charge of rolling the ball learn to throw it so it always lands in the double-zero slot? The reason is that the ball always travels a very long path around the edge of the wheel before falling, but the final slot depends upon the exact length of the entire path. Even a very slight difference in initial ball speed means the ball will land in a completely different slot.

So how can we exploit this idea to generate pseudorandom numbers?A big number (corresponding to the circumference of the wheel) times a big number(the number of trips made around the wheel before the ball comes to rest) yields a very big number (the total distance that the ball travels). Adding this distance to the starting point (the release point of the ball) determines exactly where the ball will end up. Taking the remainder of this total with respect to the wheel circumference determines the final position of the ball by subtracting all the loops made around the wheel by the ball.

The above analogy makes the appearance of mod operator in LCG equation obvious.

One does not need to know much about Jai-alai to appreciate the modeling aspects of the game and statistical techniques mentioned in the book. In fact this book is a classic story of how one goes about modeling a real life scenario and profiting from it.



With total silence around me and my mind wanting to immerse in a book, I picked up this book from my inventory. I came across a reference to this work in Aaron Brown’s book on Risk Management.

First something about the cover:

The young woman on the right is the classical Goddess Fortuna, whom today we might call Lady Luck. The young man on the left is Chance. Fortuna is holding an enormous bunch of fruits, symbolizing the good luck that she can bring. But notice that she has only one sandal. That means that she can also bring bad luck. And she is sitting on a soap bubble! This is to indicate that what you get from luck does not last. Chance is holding lottery tickets. Dosso Dossi was a court painter in the northern Italian city of Ferrara, which is near Venice . Venice had recently introduced a state lottery to raise money. It was not so different from modern state-run lotteries, except that Venice gave you better odds than any state-run lottery today. Art critics say that Dosso Dossi believed that life is a lottery for everyone. Do you agree that life is a lottery for everyone? The painting is in the J. Paul Getty Museum, Los Angeles, and the above note is adapted from notes for a Dossi exhibit, 1999.

The chapter starts with a set of 7 questions and hit is suggested that readers solve them before proceeding with the book.


The first chapter deals with some basic terminology that logicians use. The following terms are defined and examples are given to explain each of them in detail:

  • Argument: A point or series of reasons presented to support a proposition which is the conclusion of the argument.
  • Premises + Conclusion: An argument can be divided in to premises and a conclusion.
  • Propositions: Premises and conclusion are propositions, statements that can be either true or false.
  • Validity of an argument: Validity has to do with the logical connection between premises and conclusion, and not with the truth of the premises or the conclusion. If the conclusion is false, irrespective of whether the premises are true or false, we have an invalid argument.
  • Soundness of an argument: Soundness for deductive logic has to do with both validity and the truth of the premises.
  • Validity vs. Truth: Validity is not truth. It takes premises as true and proceeds to check the validity of a conclusion. If the premises are false, the reasoning can still be valid but not the TRUTH.

Logic is concerned only with the reasoning. Given the premises, it can tell you whether the conclusion is valid or not. It cannot say anything about the veracity of the premises. Hence there are two ways to criticize a deduction: 1) A premise is false, 2) The argument is invalid. So there is a division of labor. Who is an expert on the truth of premises? Detectives, nurses, surgeons, pollsters, historians, astrologers, zoologists, investigative reporters, you and me. Who is an expert on validity? A logician.

The takeaway of the chapter is that valid arguments are risk-free arguments, i.e. given the true premise; you arrive at a valid conclusion

Inductive Logic

The chapter introduces risky-arguments and inductive logic as a mechanism for reasoning. Valid arguments are risk-free arguments. A risky argument is one that is very good, yet its conclusion can be false, even when the premises are true. Inductive logic studies risky arguments. There are many forms of risky arguments like making a statement on population from a statement on sample, making a statement of sample from a statement on population, making a statement on a sample based on statement on another sample etc. Not all these statements can be studied via Inductive logic. Also, there may be more to risky arguments than inductive logic. Inductive logic does study risky arguments— but maybe not every kind of risky argument. The terms introduced in this chapter are

  • Inference to the best explanation
  • Risky Argument
  • Inductive Logic
  • Testimony
  • Decision theory

The takeaway of the chapter is that Inductive logic analyzes risky arguments using probability ideas.

The Gambler’s fallacy

This chapter talks about the gambler’s fallacy who justifies his betting on a red slot roulette wheel; given that last X outcomes on the wheel have been black. His premise is that the wheel is fair, but his action is against the premise where he is questioning the independence of outcomes. Informal Definitions are given for bias, randomness, complexity and no regularity. Serious thinking about risks, which uses probability models, can go wrong in two very different ways. 1) The model may not represent reality well. That is a mistake about the real world. 2) We can draw wrong conclusions from the model. That is a logical error. Criticizing the model is like challenging the premises. Criticizing the analysis of the model is like challenging the reasoning.

Elementary Probability Ideas

This chapter introduces some basic ideas of events, ways to compute probability of compound events etc. The chapter also gives an idea of the different terminologies used by statisticians and logicians, though they mean the same thing. Logicians are interested in arguments that go from premises to conclusions. Premises and conclusions are propositions. So, inductive logic textbooks usually talk about the probability of propositions. Most statisticians and most textbooks on probability talk about the probability of events. So there are two languages of probability. Why learn two languages when one will do? Because some students will talk the event language, and others will talk the proposition language. Some students will go on to learn more statistics, and talk the event language. Other students will follow logic, and talk the proposition language. The important thing is to be able to understand anyone who has something useful to say.

Conditional Probability

This chapter gives formulae for computing conditional probabilities. All the conditioning is done for a discrete random variable. Anything more sophisticated than a discrete RV would have alienated non-math readers of the book. A few examples are given to solidify the notions of conditional probability.

The Basic Rules of Probability & Bayes Rule

Rules of probability such as normality, additivity, total probability, statistical independence are explained via visuals. I think this chapter and previous three are geared towards a person who is a total novice in probability theory. The book also gives an intuition in to Bayes rule using elementary examples that anyone can understand. Concepts such as reliability testing are also discussed.

How to combine Probabilities and Utilities?

There are three chapters under this section. The chapter on expected value introduces a measure of the utility of a consequence and explores various lottery situations to show that cards are stacked against every lottery buyer and the lottery owner always holds an edge. The chapter on maximizing expected value says that one of the ways to choose amongst a set of actions is to choose the one that gives the highest expected value. To compute the expected value one has to represent the degrees of belief by probabilities and the consequences of action via utiles( they can be converted in to equivalent monetary units). Despite the obviousness of the expected value rule, there are a few paradoxes and those are explored in the chapter; the popular one covered being the Allais Paradox. All these paradoxes have a common message – The expected value rule does not factor in such attitudes as risk aversion and other behavioral biases and hence might just be a way to definite utilities in the first place. So, the whole expected value rule is not as water tight as it might seem. Also there are situations where decision theory cannot be of help. One may disagree about the probability of the consequences; one may also disagree about the utilities(how dangerous or desirable the consequences are). Often there is a disagreement about both probability and utility. Decision theory cannot settle such disagreements. But at least it can analyze the disagreement, so that both parties can see what they are arguing about. The last chapter in this section deals with decision theory. The three decision rules explained in the chapter are 1) Dominance rule 2) Expected value rule 3) Dominant expected value rule. Pascal’s wager is introduced to explain the three decision rules. The basic framework is to come up with a partition of possible states of affairs, possible acts that agents can undertake and utilities of the consequences of each possible act, in each possible state of affairs in the partition.

Kinds of Probability

What do you mean ?

This chapter brings out the real meaning of the word, “probability” and probably J the most important chapter of the book.

  1. This coin is biased toward heads. The probability of getting heads is about 0.6.
  2. It is probable that the dinosaurs were made extinct by a giant asteroid hitting the Earth.
    1. The probability that the dinosaurs were made extinct by a giant asteroid hitting the Earth is very high— about 0.9.
  3. Taking all the evidence into consideration, the probability that the dinosaurs were made extinct by a giant asteroid hitting the Earth is about 90%.
  4. The dinosaurs were made extinct by a giant asteroid hitting the Earth.

Statements (1) and (4) [but not (3)] are similar in one respect. Statement (4), like (1), is either true or false, regardless of what we know about the dinosaurs. If (4) is true, it is because of how the world is, especially what happened at the end of the dinosaur era. If (3) is true, it is not true because of “how the world is,” but because of how well the evidence supports statement (4). If (3) is true, it is because of inductive logic, not because of how the world is. The evidence mentioned in (3) will go back to laws of physics (iridium), geology (the asteroid), geophysics, climatology, and biology. But these special sciences do not explain why (3) is true. Statement (3) states a relation between the evidence provided by these special sciences, and statement (4), about dinosaurs. We cannot do experiments to test (3). Notice that the tests of (1) may involve repeated tosses of the coin. But it makes no sense at all to talk about repeatedly testing (3). Statement (2.a) is different from (3), because it does not mention evidence. Unfortunately, there are at least two ways to understand (2.a). When people say that so and so is probable, they mean that relative to the available evidence, so and so is probable. This the interpersonal/ evidential way. The other way to understand(2.a) is based on Personal sense of belief.

Statement (4) was a proposition about dinosaur extinction; (2 ) and (3) are about how credible (believable) (4) is. They are about the degree to which someone believes, or should believe, (4). They are about how confident one can or should be, in the light of that evidence.The use of word probability in statements(2) and (3) are related to the ideas such as belief, credibility, confidence, evidence and general name used to describe them is “Belief-type probability”

In contrast, The truth of statement(1) seems to have nothing to do with what we believe. We seem to be making a completely factual statement about a material object, namely the coin (and the device for tossing it ). We could be simply wrong, whether we know it or not . This might be a fair coin, and we may simply have been misled by the small number of times we tossed it. We are talking about a physical property of the coin, which can be investigated by experiment. The use of probability in (1) is related to ideas such as frequency, propensity, disposition etc. and the general name used to describe these is “frequency-type probability”

Belief-type probabilities have been called “epistemic”— from episteme, a Greek word for knowledge. Frequency-type probabilities have been called “aleatory,” from alea, a Latin word for games of chance, which provide clear examples of frequency-type probabilities. These words have never caught on. And it is much easier for most of us to remember plain English words rather than fancy Greek and Latin ones.

Frequency-type probability statements state how the world is. They state, for example, a physical property about a coin and tossing device, or the production practices of Acme and Bolt. Belief-type probability statements express a person’s confidence in a belief, or state the credibility of a conjecture or proposition in the light of evidence.

The takeaway from the chapter is that any statement with the word, probability carries two types of meanings, belief-type of frequency-type. It is important to understand the exact type of probability that is being talked about in any statement.

Theories about Probability

The chapter describes four theories of probability,

  1. Belief type – Personal Probability
  2. Belief type – Logical Probability – Interpersonal /Evidential probability
  3. Frequency type – Limiting frequency based
  4. Frequency type – Propensity based

Probability as Measure of Belief

Personal Probabilities

This chapter explains the way in which degrees of belief can be represented as betting rates or odds ratio. Let’s say my friend and I enter in to a bet about an event A, let’s say, “India wins the next cricket world cup“. If I think that India is 3 times more likely to win than to lose, then to translate this belief in to bet, I would invite my friend to take part in a bet where the total stake amount is 4000(Rs). My friend has agreed to bet 1000 Rs AGAINST the event and I should take the other side of the bet by offer 3000 Rs. Why is this bet according to my beliefs? My expected payoff is (1000*3/4)+(-3000*1/4=0. My friend’s expected payoff is (-1000*3/4)+(3000*1/4) = 0. Hence from my point of view it is a fair bet. There can be a bet ON the event too. I bet 3000 Rs on the event and my friend is on the other side of the bet with 1000Rs. This is again a fair bet from my belief system as my expected value is (1000*3/4)+(-3000*1/4) and my friend’s expected value is (1000*-3/4)+(3000*1/4). .By agreeing to place a bet on or against the event, my friend and I are quantifying out MY degree of belief in to betting fraction, i.e. my bet/total stake, my friend’s bet/total stake.

It is important to note that this might not be a fair bet according to my FRIEND’s belief system. He might be thinking that the event that “India wins the next cricketing world cup” has 50/50 chance. In that case, if my friend’s belief pans out, he will have an edge betting against the event and he will be at a disadvantage betting for the event. Why? In the former case, his expected payoff would be (-1000*1/2)+(3000*1/2) >0 and in the latter case, it would be (1000*1/2)+(-3000*1/2) <0. As you can see a bet in place means that the bet at least matches the belief system of one of the two players. Generalizing this to a market where investors buy and sell securities and there is a market maker, you get the picture that placing bets on securities is an act of quantifying the implicit belief system of the investors. A book maker / market marker never quotes fair bets, he always adds a component that keeps him safe, i.e., he doesn’t go bankrupt. The first ever example I came across in the context of pricing financial derivatives was in the book by Baxter and Rennie. Their introductory comments that describe arbitrage pricing and expectation pricing sets the tone for a beautiful adventure of reading the book.

The takeaway of this chapter is , 1) belief cannot be measured exactly, 2) you can think of artificial randomizers to calibrate degree of belief.


This chapter explains that betting rates ought to satisfy basic rules of probability. There are three steps to proving this argument,

  1. Personal degrees of belief can be represented by betting rates.
  2. Personal betting rates should be coherent.
  3. A set of betting rates is coherent if and only if it satisfies the basic rules of probability.

Via examples, the chapter shows that any inconsistency in odds quoted for and against by a person will lead to arbitrate in gamble. Hence the betting fractions or the odds should satisfy basic rules of probability.

The first systematic theory of personal probability was presented in 1926 by F. P. Ramsey, in a talk he gave to a philosophy club in Cambridge, England. He mentioned that if your betting rates don’t satisfy the basic rules of probability, then you are open to a sure-loss contract. But he had a much more profound— and difficult— argument that personal degrees of belief should satisfy the probability rules. In 1930, another young man, the Italian mathematician Bruno de Finetti, independently pioneered the theory of personal probability. He invented the word “coherence,” and did make considerable use of the sure-loss argument.

Learning from Experience

This chapter talks about the application of Bayes rule. It’s basically a way to combine personal probability and evidence to get a handle of an updated personal probability. The theory of personal probability was independently invented by Frank Ramsey and Bruno De Finetti. But the credit of the idea— and the very name “personal probability”— goes to the American statistician L. J. Savage (1917– 1971). He clarified the idea of personal probability and combined it with Bayes’ Rule. The chapter also talks about contributions of various statisticians/scientists such as Richard Jeffrey, Harold Jeffrey, Rudolf Carnap, and L.J. Savage, and I.J.Good.

Probability as Frequency

The four chapters under this section explore frequentist ideas. It starts off by describing some deductive connections between probability rules and our intuitions about stable frequencies. Subsequently, a core idea of frequency-type inductive inference— the significance idea is presented. The last chapter in the section presents a second core idea of frequency-type inductive inference— the confidence idea. This idea explains the way opinion polls are now reported. It also explains how we can think of the use of statistics as inductive behavior. Basically all the chapters give a crash course on classical statistics without too much of math.

Probability applied to Philosophy

The book introduces David Hume’s idea that there is no justification for inductive inferences. Karl Popper, another philosopher agreed with Hume but held the view that it doesn’t matter as inductive inferences are invalid. According to Popper, “The only good reasoning is deductively valid reasoning. And that is all we need in order to get around in the world or do science”. There are two chapters that talk about evading Hume’s problem, one via Bayesian evasion(argues that Bayes’ Rule shows us the rational way to learn from experience) and the other one via Behavior evasion(argues that although there is no justification for any individual inductive inference there is still a justification for inductive behavior).

The Bayesian’s response to Hume is :

Hume, you’re right. Given a set of premises, supposed to be all the reasons bearing on a conclusion, you can form any opinion you like. But you’re not addressing the issue that concerns us! At any point in our grown-up lives (let’s leave babies out of this), we have a lot of opinions and various degrees of belief about our opinions. The question is not whether these opinions are “rational.” The question is whether we are reasonable in modifying these opinions in the light of new experience, new evidence. That is where the theory of personal probability comes in. On pain of incoherence, we should always have a belief structure that satisfies the probability axioms. That means that there is a uniquely reasonable way to learn from experience— using Bayes’ Rule.

The Bayesian evades Hume’s problem by saying that Hume is right. But, continues the Bayesian, all we need is a model of reasonable change in belief. That is sufficient for us to be rational agents in a changing world.

The frequentist response to Hume is:

We do our work in two steps: 1) Actively interfering in the course of nature, using a randomized experimental design.2) Using a method of inference which is right most of the time— say, 95% of the time. Frequentist says: “ Hume you are right , I do not have reasons for believing any one conclusion. But I have a reason for using my method of inference, namely that it is right most of the time.”

The chapter ends with a single-case objection and discusses the arguments used by Charles Sanders Pierce. In essence, the chapter under this section point to the conclusion of Pierce:

  • An argument form is deductively valid if the conclusion of an argument of such a form is always true when the premises are true.
  • An argument form is inductively good if the conclusion of an argument of such a form is usually true when the premises are true.
  • An argument form is inductively 95% good if the conclusion of an argument of such a form is true in 95% of the cases where the premises are true.


imageTakeaway :

The field of probability was not discovered; rather, it was created by the confusion of two concepts. The first is the frequency with which certain events recur, and the second is the degree of belief to attach to a proposition. If you want to understand these two schools of from a logician’s perspective and get a grasp on various philosophical takes on the word, “probability”, then this book is a suitable text as it gives a thorough exposition without too much of math.



This book is about a set of letters exchanged between Pascal and Fermat in the year 1654 that led to a completely different way of looking at future. The main content of the letters revolved around solving a particular problem, called “problem of points”. A simpler version of the problem goes like this:

Suppose two players—call them Blaise and Pierre—place equal bets on who will win the best of five tosses of a fair coin. They start the game, but then have to stop before either player has won. How do they divide the pot? If each has won one toss when the game is abandoned after two throws, then clearly, they split the pot evenly, and if they abandon the game after four tosses when each has won twice, they do likewise. But what if they stop after three tosses, with one player ahead 2 to 1?

It is not known how many letters were exchanged between Pascal and Fermat to solve this problem, but the entire correspondence took place in 1654. By the end of it, Pascal and Fermat had managed to do what was unthinkable till then – “Predict the future”, more importantly act based on predicting the future.

Pascal tried to solve the problem using recursion whereas Fermat did it in a simpler way,i.e. by enumerating the future outcomes, had the game continued. The solution gave rise to a new way of thinking and it is said that this correspondence marked the birth of risk management, as we know today.

The book is not so much as an analysis of the solution(as the author believes that today, anyone who has had just a few hours of instruction in probability theory can solve the problem of the points with ease) but more about the developments leading to 1654 and developments after the 1654. In the process, the book recounts all the important personalities who played a role in making probability from a gut based discipline to a rigorous mathematical discipline. The book can be easily read in an hour’s time and could have been a blog post.


Next Page »