January 2016


Ani Pema Chodron, the author of the book, gave a commencement address to the 2014 graduating class of Naropa, University of Boulder, Colorado. She did so, to keep her promise with her grand daughter, who was amongst the graduating class. The speech went viral on the net and this book is an offshoot of it. It contains the full text of the speech and a Q&A session. The title of the speech and hence the book, is inspired from a quote by Samuel Beckett. 

The author says that she had received one of the best pieces of advice, while learning how to teach.

I was being taught how to teach, as many of us were. And the instructions I received were to prepare well, know your subject, and then go in there with no note cards. Honestly, that is the best advice for life: no note cards. Just prepare well and know what you want to do. Give it your best, but you really don’t have a clue what’s going to happen. And note cards have limited usefulness.

The crux of the speech is author’s take on how to fail ? Most of the strategies that we adopt when faced with failure fall in to two categories. We either blame it on something external or become excessively critical about our self. How does one handle a failure ?

The author says,

We move away from the rawness, of holding the rawness of vulnerability in our heart, by blaming it on the other. Getting curious about outer circumstances and how they are impacting you, noticing what words come out and what your internal discussion is, this is the key.

So sometimes you can take rawness and vulnerability and turn it into creative poetry, writing, dance, music, song. Artists have done this from the beginning of time. Turn it into something that communicates to other people, and out of this raw and vulnerable space, communication really happens.

It’s in that space-when we aren’t masking ourselves or trying to make circumstances go away-that our best qualities begin to shine.


This book gives a a macro picture of  machine learning. In this post, I will briefly summarize the main points of the book. One can think of this post as a meta summary as the book itself is a summary of all the main areas of machine learning.



Machine learning is all around us, embedded in technologies and devices that we use in our daily lives. They are so integrated with our lives that we often do not even pause to appreciate its power. Well, whether we appreciate or not, there are companies harnessing the power of ML and profiting from it. So, the question arises, whether we need to care about ML at all ? When a technology becomes so pervasive, you need to understand a bit. You can’t control what you don’t understand. Hence at least from that perspective, having a general overview of the technologies involved, matters.

Machine learning is something new under the sun: a technology that builds itself. The artifact in ML is referred to as a learning algorithm. Humans have always been designing artifacts, whether they are hand built or mass produced. But learning algorithms are artifacts that design other artifacts. A learning algorithm is like a master craftsman: every one of its productions is different and exquisitely tailored to the customer’s needs. But instead of turning stone in to masonry or gold in to jewelry, learners turn data into algorithms. And the more data they have, the more intricate the algorithms can be.

At its core, Machine learning is about prediction: predicting what we want, the results of our actions, how to achieve our goals, how the world will change.

The author says that he has two goals in writing this book:

  • Provide a conceptual model of the field,i.e. rough knowledge so that you can use it effectively. There are many learning algorithms out there and many are being invented every year. The book provides an overview of learning algos by categorizing the people who use them. The author calls each category a tribe. Each tribe has its own master algorithm for prediction
    • Symbolists: They view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic.
      • Master algorithm for this tribe is inverse deduction
    • Connectionists: They reverse engineer the brain and are inspired by neuroscience and physics.
      • Master algorithm for this tribe is backpropagation
    • Evolutionaries: They simulate evolution on the computer and draw on genetics and evolutionary biology.
      • Master algorithm for this tribe is genetic programming
    • Bayesians: They believe that learning is a form of probabilistic inference and have their roots in statistics.
      • Master algorithm for this tribe is Bayesian inference
    • Analogizers: They learn by extrapolating from similarity judgments and are influenced by psychology and mathematical optimization.
      • Master algorithm for this tribe is Support vector machines.

In practice each of the master algorithms that a particular tribe uses is good for some type of problems but not for others. What we really want is a single algorithm combining the key features of all of them, The Master Algorithm

  • Enable the reader to invent the master algorithm. A layman, approaching the forest from a distance, is in some ways better placed than the specialist, already deeply immersed in the study of particular trees. The author suggests the reader to pay attention to each tribe and get a general overview of what each tribe does and what tools that each tribe uses. By viewing each tribe as a piece of a puzzle, it would be easy to get a conceptual clarity of the entire field as a whole.

The Machine Learning Revolution

An algorithm is a sequence of instructions telling a computer what to do. The simplest algorithm is : flip a switch. The second simplest algorithm is : combine two bits. The idea connecting transistors and reasoning was understood by Shannon and his masters thesis lead to the most important scientific discipline – information theory. Computers are all about logic. Flipping a set of transistors is all what any algorithm does. However behind this benign activity, some of the most powerful algorithms go about doing their work by using some preexisting algos as building blocks. So, if computing power increases, do all the algos automatically become efficient and all powerful? Not necessarily. The serpent in the garden goes by the name "complexity monster". There are many heads to this complexity monster.

  • Space complexity : the number of bits of info that an algo needs to store in the computer’s memory
  • Time complexity : how long the algo takes to run
  • Human complexity : when algos become too complex, humans cannot control them and any errors in the algo execution causes panic.

Every algorithm has an input and an output – the data goes into the computer, the algo does the job and gives out the result. Machine learning turns this around : in goes the data and the desired result and out comes the algorithm that turns one in to the other. Learning algorithms – learners- are algorithms that make other algorithms. With machine learning, computers write their own programs. The author uses a nice low tech analogy to explain the power of machine learning :

Humans have known a way to get some of the things they need by letting nature make them. In farming, we plant the seeds, make sure they have enough water and nutrients, and reap the grown crops. The promise of machine learning is that the technology can mirror farming. Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.

This analogy makes two things immediate. First the more data we have, the more we can learn. Second, ML is a sword that can slay the complexity monster. If one looks at any learning algorithm, it does broadly two things. It either learns knowledge or learns some skills, i.e. it learns knowledge in terms of statistical models or learns procedures that underlie a skill. The author talks about another analogy in the information eco-system. He equates databases, crawlers, indexers and so on as herbivores, patiently munging on endless fields of data. Statistical algos, online analytical processing are the predators and learning algorithms are the super predators. Predators turn data in to information and Super predators turn information in to knowledge.

Should one go and spend years in computer science discipline to become ML expert ? Not necessarily. CS makes you think deterministically and ML needs probabilistic thinking. The difference in thinking is a large part of why Microsoft has had a lot more trouble catching up with Google than it did with Netscape. A browser is just a standard piece of software, but a search engine requires a different mind-set.

If you look at all the fantastic stuff behind ecommerce applications, it is all about match making; producers of information are being connected to consumers of information. In this context ,learning algorithms are the match makers. The arsenal of learning algos will serve as a key differentiator between the companies. Data is indeed the new oil.

The chapter ends with a discussion of how political campaigning is being revolutionized by ML. With truck loads of data at each voter level, politicians are turning to ML experts to help them with campaign analysis and directed targeting. The author predicts that in the future, machine learning will cause more elections to be close. What he means by that is that learning algorithms will be ultimate retail politicians.


The Master Algorithm

If you look at any of the algorithms such as nearest neighbor, decision tree, naive Bayes etc., they are all domain agnostic. This warrants a question, "Can there be one learner that does everything, across all domains?". If you have used let’s say a frequentist way of estimating a parameter and subsequently check it with a Bayesian inference, then both parameters give almost the same distribution if there is a LOT of data. Under a deluge of data, you will most certainly get a convergence of parameters in a model, irrespective of the model used. However the reality is that there is a scarcity of data, that necessitates making assumptions. Depending on the type of problem and depending on the type of assumption, certain class of learning models are better than the rest. If one speculates the presence of a master algorithm, then the assumptions also need to go in as input. This chapter explores the presence of a master algorithm that can be learnt from input data, output data and the assumptions. The author grandly states the central hypothesis of the book as

All knowledge-past, present, and future-can be derived from data by a single, universal learning algorithm.

The author speculates the presence of a master algorithm and says that there are arguments from many fields that speculate the presence of one.

The argument from Neuroscience : The evidence from the brain suggests that it uses the same learning algorithm, throughout, with the areas dedicated to the different senses distinguished only by the different inputs they are connected to. In turn, the associative areas acquire their function by being connected to multiple sensory regions, and the "executive" areas acquire theirs by connecting the associative areas and motor output.

If you look at any cortex in a human brain, you find that the wiring pattern is similar. The cortex is organized in to six different layers, feedback loops, short range inhibitory connections and long-term excitatory connections. This pattern is very much common across the brain. There is some variation in the patterns but the structure pretty much is the same. The analogy is that the it is the same algo with different parameters and settings. Low-level motor skills are controlled by cerebellum that has a clearly different and regular architecture. However experiments have shown that cerebellum can be perfectly replaced by cortex. So, this again goes to suggest that there is some "master algorithm" controlling the entire brain.

If you look at a set of learning algos in the ML field, you can infer that at some level they are trying to reverse engineer the brain’s function. One of the five ML tribes, Connectionists, believe in this way of modeling the world.

The argument from Evolution : If one looks at the evolution of species since the beginning of earth, one can think of natural selection or whatever the nature does as an algorithm. This algorithm has all the species as inputs and the species at any given point in time as the output. The master algo does the work of eliminating certain species, allowing certain species to mutate, etc. It is a dynamic process where the outputs are again fed as inputs. This line of thought makes one speculate the presence of a master algorithm. In fact one of the five tribes in the ML world, Evolutionaries, strongly believe in this way of modeling.

The argument from Physics : Most of the physics is driven by simple equations that prune away all the noise in the data and focus on the underlying beauty. Physics laws discovered in one domain are seamlessly applied to other domains. If everything we see in the nature could be explained by few simple laws, then it makes sense that a single algorithm can induce all the can be induced. All the Master Algorithm has to do is provide a shortcut to the laws’ consequences, replacing impossibly long mathematical derivations with much shorter ones based on actual observations. Another way to look at scientific disciplines is to think of various laws, states as outcomes of an dynamic optimization problem. However physics is unique in its simplicity. It’s only reasonably effective.

Machine learning is what you get when the unreasonable effectiveness of mathematics meets the unreasonable effectiveness of data.

The argument from Statistics : Bayesians look at the world from a probabilistic and learning mindset. Bayes rule is a recipe to turn data in to knowledge. In the yesteryears, Bayes applications were confined to simple applications. However with the rise of computing power, Bayes applications are now being used to a wide range of complex modeling situations. Is Bayes the master algorithm ? Well, there seems to many critics to the Bayesian approach of modeling. All said and done, it definitely appears that Bayesian inference will be a part of "Master Algorithm" in some way.

The argument from computer science : The author mentions the famous unsolved problem in computer science, P vs. NP. NP-complete are a set of problems that are equivalent in their computational hardness. If you solve one of the problems, you solve the rest. For decades, many mathematicians, researchers and scientists have been finding clever tricks to almost solve NP-complete problems. But the fundamental problem still eludes us -  Is the class of problems for which we can efficiently compute same as the class of problems for which we can efficiently check whether a solution exists ? If you read up on this problem, you will realize that all the NP-complete problems can be reduced to a satisfiability problem. If we invent a learner that can learn to solve satisfiability, it has a good claim to being the Master Algorithm.

NP-completeness aside, the sheer fact that a computer can do a gazillion tasks should make one confident about speculating the presence of a master algorithm that does the job across several problems. The author uses the example of Turing machine and says that back then, it was unthinkable to actually see a Turing machine in action. Turing machine can solve every conceivable problem that can be solved by logical deduction. The fact that we see these machines everywhere means that, despite the odds, we might see a Master Algorithm sometime in the future.

The Master Algorithm is for induction, the process of learning, what the Turing machine is for deduction. It can learn to simulate any other algorithm by reading examples of its input-output behavior. Just as there are many models of computation equivalent to a Turing machine, there are probably many different equivalent formulations of a universal learner. The point, however, is to find the first such formulation, just as Turing found the first formulation of the general-purpose computer.

Interesting analogy : The author is of the opinion that "human intuition" can’t replace data. There have been many instances where human intuition has gone terribly wrong and a guy with lots of data has done better. The author uses Brahe, Kepler and Newton’s work to draw a parallel to the machine learning.

Science goes through three phases, which we can call the Brahe, Kepler, and Newton phases. In the Brahe phase, we gather lots of data, like Tycho Brahe patiently recording the positions of the planets night after night, year after year. In the Kepler phase, we fit empirical laws to the data, like Kepler did to the planets’ motions. In the Newton phase, we discover the deeper truths. Most science consists of Brahe-and-Kepler-like work; Newton moments are rare. Today, big data does the work of billions of Brahes, and machine learning the work of millions of Keplers. If-let’s hope so-there are more Newton moments to be had, they are as likely to come from tomorrow’s learning algorithms as from tomorrow’s even more overwhelmed scientists, or at least from a combination of the two.

Critics of Master Algo : Well, for a concept as ambitious as Master Algo, there are bound to be critics and there are many. The author mentions a few of them as examples,

  • Knowledge engineers
  • Marvin Minsky
  • Naom Chomsky
  • Jerry Fodor

Hedgehog or Fox : One of the other questions that comes up when we think of "Master Algorithm" is whether it is a fox or a hedgehog. There are many studies that have shown that being a fox is far better than being a hedgehog. The hedgehog is synonymous with that of an "expert". In the context of this book though, a learning algorithm can be considered as "hedgehog" if variations of it can solve all the learning problems. The author hopes that the "Master Algorithm" turns out to a hedgehog.

Five tribes of ML : In the quest for master algorithm, we do not have to start from scratch. There are already many decades of ML research underway and each type of research community is akin to a tribe. The author describes five tribes, i.e. symbolists, connectivists, Bayesians, evolutionaries and analogizers. Each of the tribes uses its own master algo. Here is an illustration from the author’s presentation



But the real master algo that the author is hinting at is an algo that combines all the features of the tribes.

The most important feature of each of the tribe is that they firmly believe that theirs is the only way to model and predict. Unfortunately, this thinking hinders their ability to model a broad set of problems. For example, a Bayesian would find it extremely difficult to leave the probabilistic inference method and look at the problem from a evolutionary point of view. His thinking is forged based on priors, posteriors and likelihood functions. If a Bayesian were to look at an evolutionary algorithm like a genetic algo, he might not critically analyze it and adapt it to the problem at hand. This limitation is prevalent across all tribes. Analogizers love support vector machines but there are limited because they look for similarities of inputs across various dimensions; i.e. they are bound to be hit by curse of dimensionality. The same serpent,"the curse of dimensionality" that the author talks about in the previous chapters comes and bites each tribe, depending on the type of problem being solved.

The obvious question that arises in a reader’s mind is, can there be combination of tribes that come together to solve a specific set of problems ? Indeed the tribe categorization is not a hard categorization of the algorithms. It is just meant as a starting point so that you can place the gamut of algos in separate buckets.


Hume’s problem of Induction

The chapter starts with a discussion of "Rationalism vs. Empiricism". The rationalist likes to plan everything in advance before making the first move. The empiricist prefers to try things and see how they turn out. There are philosophers who strongly believe in one and not in the other. From a practical standpoint, there have been productive contributions to our world from both the camps. David Hume is considered to be one of the greatest empiricist of all time. In the context of Machine Learning, one of his questions has hung like a sword of Damocles over all the knowledge, which is,

How can we ever be justified in generalizing from what we’ve seen to what we haven’t?

The author uses a simple example where you have to decide to ask someone out for a date or not. The dataset used in the example illustrates Hume’s problem of induction, i.e. there is no reason to pick one generalization over another. So, a safe way out of the problem, at least to begin with, is to assume that future will be like the past. Is this enough ? Not really. In the ML context, the real problem is : How to generalize to cases that we haven’t seen before. One might think that by amassing huge datasets, you can solve this problem. However once you do the math, you realize that you will run out of data that covers all the cases needed to carry the inductive argument safely. Each new data point is most likely unique and you have no choice but to generalize. According to Hume, there is no way to do it

If this all sounds a bit abstract, suppose you’re a major e-mail provider, and you need to label each incoming e-mail as spam or not spam. You may have a database of a trillion past e-mails, each already labeled as spam or not, but that won’t save you, since the chances that every new e-mail will be an exact copy of a previous one are just about zero. You have no choice but to try to figure out at a more general level what distinguishes spam from non-spam. And, according to Hume, there’s no way to do that.

The "no free lunch" theorem : If you have been reading some general articles in the media on ML and big data, it is likely that you would have come across a view on the following lines:

With enough data, ML can churn out the best learning algo. You don’t have to have strong priors, the fact that you have large data is going to give you all the power to understand and model the world.

The author introduces David Wolpert’s "no free lunch" theorem that a limit on how good a learning algorithm can be. The theorem says that no learner can be better than random guessing. Are you surprised by this theorem ? Here is how one can reconcile to it,

Pick your favorite learner. For every world where it does better than random guessing, I, the devil’s advocate, will deviously construct one where it does worse by the same amount. All I have to do is flip the labels of all unseen instances. Since the labels of the observed ones agree, there’s no way your learner can distinguish between the world and the antiworld. On average over the two, it’s as good as random guessing. And therefore, on average over all possible worlds, pairing each world with its antiworld, your learner is equivalent to flipping coins.

How to escape the above the random guessing limit? Just care about the world we live in and don’t care about alternate worlds. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing. What are the implications of "free lunch theorem" in our modeling world ?

There’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.

Unwritten rule of Machine learning : The author states that the principle laid out by Newton in his work, "Principia", that serves as the first unwritten rule of ML

Whatever is true of everything we’ve seen is true of everything in the universe.

Newton’s principle is only the first step, however. We still need to figure out what is true of everything we’ve seen-how to extract the regularities from the raw data. The standard solution is to assume we know the form of the truth, and the learner’s job is to flesh it out. One of the ways to think about creating a form is via "conjunctive concepts", i.e a series of statements with AND as the bridge. The problem with "conjunctive concepts" is that they are practically useless. Real world is driven by "disjunctive concepts", i.e a concept defined by a set of rules. One of the pioneers in this approach of discovering rules was Ryszard Michalski, a Polish computer scientist. After immigrating to the United States in 1970, he went on to found the symbolist school of machine learning, along with Tom Mitchell and Jaime Carbonell.

Overfitting and Underfitting : The author uses the words "blindness" and "hallucination" to describe underfitting and overfitting models. By using ton of hypothesis, you can almost certainly overfit the data. On the other hand, being sparse in your hypothesis set, you can fail to see the true patterns in the data. This classic problem is obviated by doing out-of-sample testing. Is it good enough ? Well, that’s the best that is available without going in to the muddy philosophical debates or alternative pessimistic approaches like that of Leslie Valiant(author of Probably Approximately Correct).

Induction as inverse of deduction : Symbolists work via the induction route and formulate an elaborate set of rules. Since this route is computationally intensive for large dataset, the symbolists prefer something like decision trees. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. How do we then decide which concept the instance belongs to?

Decision trees are used in many different fields. In machine learning, they grew out of work in psychology. Earl Hunt and colleagues used them in the 1960s to model how humans acquire new concepts, and one of Hunt’s graduate students, J. Ross Quinlan, later tried using them for
chess. His original goal was to predict the outcome of king-rook versus king-knight endgames from the board positions. From those humble beginnings, decision trees have grown to be, according to surveys, the most widely used machine-learning algorithm. It’s not hard to see why: they’re easy to understand, fast to learn, and usually quite accurate without too much tweaking. Quinlan is the most prominent researcher in the symbolist school. An unflappable, down-to-earth Australian, he made decision trees the gold standard in classification by dint of relentlessly improving them year after year, and writing beautifully clear papers about them. Whatever you want to predict, there’s a good chance someone has used a decision tree for it.

The Symbolists : The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols. A mathematician solves equations by moving symbols around and replacing symbols by other symbols according to predefined rules. The same is true of a logician carrying out deductions. According to this hypothesis, intelligence is independent of the substrate.

Symbolist machine learning is an offshoot of the knowledge engineering school of AI. The use of computers to automatically learn the rules made the work of pioneers like Ryszard Michalski, Tom Mitchell, and Ross Quinlan extremely popular and since then the field has exploded

What are the shortcomings of inverse deduction?

  • The number of possible inductions is vast, and unless we stay close to our initial knowledge, it’s easy to get lost in space
  • Inverse deduction is easily confused by noise
  • Real concepts can seldom be concisely defined by a set of rules. They’re not black and white: there’s a large gray area between, say, spam and nonspam. They require weighing and accumulating weak evidence until a clear picture emerges. Diagnosing an illness involves giving more weight to some symptoms than others, and being OK with incomplete evidence. No one has ever succeeded in learning a set of rules that will recognize a cat by looking at the pixels in an image, and probably no one ever will.

An interesting example of a success from Symbolists is Eve, the computer that discovered malaria drug. There was a flurry of excitement a year ago, when an article, titled, Robot Scientist Discovers Potential Malaria Drug was published in Scientific American. This is the kind of learning that Symbolists are gung-ho about.


How does your brain learn ?

This chapter covers the second tribe of the five tribes mentioned in the book. This tribe is called "Connectionists". Connectionists are highly critical about the way Symbolists work as they think that describing something via a set of rules is just the tip of iceberg. There is lot more going under the surface that formal reasoning can’t see. Let’s say you come across the word "love", Symbolists would associate a rule with such a concept whereas Connectionists would associate various parts of the brain to such a concept. In a sense, there is no one to one correspondence between a concept and a symbol. Instead the correspondence is many to many. Each concept is represented by many neurons, and each neuron participates in representing many different concepts. Hebb’s rule is the corner stone of connectionists. In a non-math way, it says that "Neurons that fire together stay together". The other big difference between Symbolists and Connectionists is that the former tribe believes in sequential processing whereas the latter tribe believes in parallel processing.

To get some basic understanding of the key algos used by connectionists, it is better to have a bit of understanding of the way neuron is structured in our brain. Here is a visual that I picked up from the author’s presentation :


The branches of the neuron connect to others via synapses and basic learning takes place via synaptic connections. The first formal model of a neuron was proposed by Warren McCulloch and Walter Pitts in 1943. It looked a lot like the logic gates computers are made of. The problem with this model was that the model did not learn. It was Frank Rosenblatt who came up with the first model of learning by giving variable weights to the connections between neurons. The following is a good schematic diagram of the perceptron:


This model generated a lot of excitement and ML received a lot of funding for various research projects. However this excitement was short lived. Marvin Minsky and few others published many examples where perceptron failed to learn. One of the most simple and dangerous example that perceptron could not learn was XOR operator. Perceptron was mathematically unimpeachable, searing in its clarity, and disastrous in its effects. Machine learning at the time was associated mainly with neural networks, and most researchers (not to mention funders) concluded that the only way to build an intelligent system was to explicitly program it. For the next fifteen years, knowledge engineering would hold center stage, and machine learning seemed to have been consigned to the ash heap of history.

Fast forward to John Hopfield work on spin glasses, there was a reincarnation of perceptron

Hopfield noticed an interesting similarity between spin glasses and neural networks: an electron’s spin responds to the behavior of its neighbors much like a neuron does. In the electron’s case, it flips up if the weighted sum of the neighbors exceeds a threshold and flips (or
stays) down otherwise. Inspired by this, he defined a type of neural network that evolves over time in the same way that a spin glass does and postulated that the network’s minimum energy states are its memories. Each such state has a "basin of attraction" of initial states that converge to it, and in this way the network can do pattern recognition: for example, if one of the memories is the pattern of black-and-white pixels formed by the digit nine and the network sees a distorted nine, it will converge to the "ideal" one and thereby recognize it. Suddenly, a vast body of physical theory was applicable to machine learning, and a flood of statistical physicists poured into the field, helping it break out of the local minimum it had been stuck in.

The author goes on to describe "Sigmoid" function and its ubiquitous nature. If you think about the curve for sometime, you will find it everywhere. I think the first time I came across this function was in Charles Handy’s book, "The Age of Paradox". Sigmoid functions in that book are used to describe various types of phenomenon that show an exponential slow rate of increase in the beginning, then a sudden explosive rate of increase and subsequently with an exponential rate of decrease. Basically if you take the first derivative of the Sigmoid function, you get the classic bell curve. I think the book,"The Age of Paradox" had a chapter with some heavy management gyan that went something like – "you need to create another Sigmoid curve in your life before the older Sigmoid curve starts a downfall" or something to that effect. I don’t quite recollect the exact idea from Charles Handy’s book, but there is a blog post by Bret Simmons, titled The Road to Davy’s Bar that goes in to related details.

Well, in the context of ML, the application of Sigmoid curve is more practical. It can be used to replace the step function and suddenly things become more tractable. A single neuron can learn a straight line but a set of neurons, i.e multi-layer perceptron can learn more convoluted curves. Agreed there is a curse of dimensionality here, but if you think about it, the hyperspace explosion is a double edged sword. On the one hand, there objective function is far more wiggly but on the other hand, there is a less scope that you will stuck at a local minimum via gradient search methods. With this Sigmoid input and multi layer tweak, Perceptron came back with vengeance. There was a ton of excitement just like the time when perceptron was introduced. The algorithm by which the learning takes place is called "back propagation", a term that is analogous to how human brains work. This algo was invented by David Rumelhart in 1986. It is a variant of gradient descent method. There is no mathematical proof that Back propagation will find the global minimum/maximum, though. The backprop solves what the author calls "credit assignment" problem. In a multi-layered perceptron the error between the target value and the current value needs to be propagated across all layers backward. The basic idea of error propagation, i.e error assignment for each of the layers is done via backprop.

Whenever the learner’s "retina" sees a new image, that signal propagates forward through the network until it produces an output. Comparing this output with the desired one yields an error signal, which then propagates back through the layers until it reaches the retina. Based on this returning signal and on the inputs it had received during the forward pass, each neuron adjusts its weights. As the network sees more and more images of your grandmother and other people, the weights gradually converge to values that let it discriminate between the two.

Sadly the excitement phase petered down as learning with dozens of hundreds of hidden layers was computationally difficult. In the recent years though, backpropagation has made a comeback thanks to huge computing power and big data. It now goes by the technique "Deep Learning". The key idea of deep learning is based on auto encoders that is explained very well by the author. However there are many things that need to be worked out for deep learning to be anywhere close to the Master algorithm. All said and done there are a few limitations to exclusively following a connectionist tribe. Firstly, the learning algo is difficult to comprehend. It comprises convoluted connections between various neurons. The other limitation is that the approach is not compositional, meaning it is divorced from the way a big part of human cognition works.


Evolution : Nature’s Learning Algorithm

The chapter starts with the story of John Holland, the first person to have earned a PhD in computer science in 1959. Holland is known for his immense contribution to Genetic algorithms. His key insight lay in coming up with a fitness function that would assign a score to every program considered. What’s the role of fitness function ? Starting with a population of not-very-fit individuals-possibly completely random ones-the genetic algorithm has to come up with variations that can then be selected according to fitness. How does nature do that? This is where the genetic part of the algorithm comes in. In the same way that DNA encodes an organism as a sequence of base pairs, we can encode a program as a string of bits. Variations are produced by crossovers and mutations. The next breakthrough in the field of genetic programming came from Holland’s student John Koza who came up with the idea of evolving full blown computer programs.

Genetic programming’s first success, in 1995, was in designing electronic circuits. Starting with a pile of electronic components such as transistors, resistors, and capacitors, Koza’s system reinvented a previously patented design for a low-pass filter, a circuit that can be used for things like enhancing the bass on a dance-music track. Since then he’s made a sport of reinventing patented devices, turning them out by the dozen. The next milestone came in 2005, when the US Patent and Trademark Office awarded a patent to a genetically designed factory optimization system. If the Turing test had been to fool a patent examiner instead of a conversationalist, then January 25, 2005, would have been a date for the history books. Koza’s confidence stands out even in a field not known for its shrinking violets. He sees genetic programming as an invention machine, a silicon Edison for the twenty-first century.

A great mystery in genetic programming that is yet to be solved conclusively is the role of crossover. None of Holland’s theoretical results show that crossover actually helps; mutation suffices to exponentially increase the frequency of the fittest schemas in the population over time. There were other problems with genetic programming that finally made ML community at large divorce itself from this tribe

Evolutionaries and connectionists have something important in common: they both design learning algorithms inspired by nature. But then they part ways. Evolutionaries focus on learning structure; to them, fine-tuning an evolved structure by optimizing parameters is of secondary importance. In contrast, connectionists prefer to take a simple, hand-coded structure with lots of connections and let weight learning do all the work. This is machine learning’s version of the nature versus nurture controversy. As in the nature versus nurture debate, neither side has the whole answer; the key is figuring out how to combine the two. The Master Algorithm is neither genetic programming nor backprop, but it has to include the key elements of both: structure learning and weight learning. So, is this it ? Have we stumbled on to the right path for "Master Algorithm" ? Not quite. There are tons of problems with evolutionary algos. Symbolists and Bayesians do not believe in emulating nature. Rather, they want to figure out from first principles what learners should do. If we want to learn to diagnose cancer, for example, it’s not enough to say "this is how nature learns; let’s do the same." There’s too much at stake. Errors cost lives. Symbolists dominated the first few decades of cognitive psychology. In the 1980s and 1990s, connectionists held sway, but now Bayesians are on the rise.


In the Church of the Reverend Bayes

Perci Diaconis in his paper titled, MCMC Revolution, says that MCMC technique that came from Bayesian tribe has revolutionized applied mathematics. Indeed, thanks to high performance computing ability, Bayes is now a standard tool in any number cruncher’s tool kit. This chapter talks about various types of Bayesian techniques. The basic idea behind Bayes is that it is a systematic and quantified way of updating degrees of belief, in the light of new data. You can pretty much cast any problem, irrespective of the size of the data available, in to a Bayesian inference problem. Bayes theorem usually goes by the name "inverse probability" because in real life we know Pr(effect|cause) and we are looking to compute Pr(cause|effect). Bayes’ theorem as a foundation for statistics and machine learning is bedeviled not just by computational difficulty but also by extreme controversy. The main point of conflict between Bayesians and Non-Bayesians is the reliance of subjective priors to "turn on the Bayesian crank". Using subjective estimates as probabilities is considered sin by Frequentists, for whom, everything should be learned from the data.

One of the most common variant of Bayesian models is the "Naive Bayes" model where each cause is independent of other causes in creating an effect. Even though this assumption sounds extremely crazy, there are a ton of areas where Naive Bayes beats sophisticated models. No one is sure who invented the Naïve Bayes algorithm. It was mentioned without attribution in a 1973 pattern recognition textbook, but it only took off in the 1990s, when researchers noticed that, surprisingly, it was often more accurate than much more sophisticated learners. Also if you reflect a bit, you will realize that Naive Bayes is closely related to Perceptron algorithm.

The author mentions Markov Models as the next step in the evolution of Bayes models. Markov models are applicable to a family of random variables where each variable is conditionally independent of its history except the current state. Markov chains turn up everywhere and are one of the most intensively studied topics in mathematics, but they’re still a very limited kind of probabilistic model. A more complicated model is Hidden Markov Model where we don’t get to see the actual states but we have to infer them from the observations. A continuous version of HMM goes under the name "Kalman Filter" that has been used in many applications across domains.

Naive Bayes, Markov Models, Hidden Markov Models are all good but they are all a far cry from Symbolists. The next breakthrough came from Judea Pearl who invented Bayesian Networks. This allowed one to specify complex dependencies among random variables. By defining the conditional independence of a variable given a set of neighboring nodes, Bayesian networks tame the combinatorial explosion and make inferences tractable. Basically Bayesian Network can be thought of as a "generative model", a recipe for probabilistically generating a state of the world. Despite the complex nature of a Bayesian net, the author mentions that there have been techniques developed to successfully infer various aspects of the network. In this context, the author mentions MCMC and gives an intuitive explanation of the technique. A misconception amongst many is that MCMC is a simulation technique. Far from it, the procedure does not simulate any real process; rather it is an efficient way to generate samples from a Bayesian network. Inference in Bayesian networks is not limited to computing probabilities. It also includes finding the most probable explanation for the evidence. The author uses the "poster child" example of inferring the probability of heads from coin tosses to illustrate the Bayesian technique and compare it with the Frequentist world of inference.

The next set of models that came to dominate the Bayesian tribe is Markov Networks. A Markov network is a set of features and corresponding weights, which together define a probability distribution. Like Bayesian networks, Markov networks can be represented by graphs, but they have undirected arcs instead of arrows. Markov networks are a staple in many areas, such as computer vision. There are many who feel that Markov networks are far better than Naive Bayes, HMMs etc., as they can capture the influence from surroundings.

Bayesians and symbolists agree that prior assumptions are inevitable, but they differ in the kinds of prior knowledge they allow. For Bayesians, knowledge goes in the prior distribution over the structure and parameters of the model. In principle, the parameter prior could be anything we please, but ironically, Bayesians tend to choose uninformative priors (like assigning the same probability to all hypotheses) because they’re easier to compute with. For structure, Bayesian networks provide an intuitive way to incorporate knowledge: draw an arrow from A to B if you think that A directly causes B. But symbolists are much more flexible: you can provide as prior knowledge to your learner anything you can encode in logic, and practically anything can be encoded in logic-provided it’s black and white.

Clearly, we need both logic and probability. Curing cancer is a good example. A Bayesian network can model a single aspect of how cells function, like gene regulation or protein folding, but only logic can put all the pieces together into a coherent picture. On the other hand, logic can’t deal with incomplete or noisy information, which is pervasive in experimental biology, but Bayesian networks can handle it with aplomb.Combining connectionism and evolutionism was fairly easy: just evolve the network structure and learn the parameters by backpropagation. But unifying logic and probability is a much harder problem.


You are what you resemble

The author introduces techniques of the "Analogizers" tribe. This tribe uses similarities among various data points to categorize them in to distinct classes. In some sense, we all learn by analogy. Every example that illustrates an abstract concept is like an analogy. We learn by relating the similarity between two concepts and then figure what else one can infer based on the fact that two concepts are similar.

The chapter begins with explaining the most popular algorithm of the tribe, "the nearest neighbor algorithm". This was invented way back in 1951 by Evelyn Fix and Joe Hodges. The inventors faced a massive difficulty in publishing their algorithm. However the fact that the algo remained unpublished did not faze many researchers who went about developing variants of the algorithm like "K nearest neighbor" method, "Weighted K-nearest neighbor" etc. It was in 1967 that Tom Cover and Peter Hart proved that, given enough data, nearest-neighbor is at worst only twice as error-prone as the best imaginable classifier. This was a momentous revelation. Up until then, all known classifiers assumed that the frontier had a very specific form, typically a straight line. This was a double-edged sword: on the one hand, it made proofs of correctness possible, as in the case of the perceptron, but it also meant that the classifier was strictly limited in what it could learn. Nearest-neighbor was the first algorithm in history that could take advantage of unlimited amounts of data to learn arbitrarily complex concepts. No human being could hope to trace the frontiers it forms in hyperspace from millions of examples, but because of Cover and Hart’s proof, we know that they’re probably not far off the mark.

Is nearest neighbor algo, the master algorithm ? It isn’t because of curse of dimensionality. As the dimension of covariates goes up, the NN algo efficiency goes down. In fact the curse of dimensionality is the second most important stumbling block in the Machine learning, over-fitting being the first one. There are certain techniques to handle the dimension explosion but most of them are hacks and there is no guarantee that they are going to work.

Subsequently, the author introduces Support Vector Machines(SVM) that has become the most popular technique used by Analogizers. I loved the way author describes this technique using plain simple English. He asks the reader to visualize a fat serpent that moves between two countries that are at war. The story of finding the serpent incorporates pretty much all the math that is needed to compute support vectors, i.e.

  • kernel for SVM
  • support vectors
  • weight of the support vectors
  • constrained optimization
  • maximizing the margin of the classifier

My guess is, one would understand the math far easier, after reading through this section on SVMs. SVMs have many advantages and the author highlights most of them. Books such as these also help us in verbalizing math stuff in simple words. For example, if you were to explain the difference between constrained optimization and unconstrained optimization to a taxi driver, how would you do it? Read this book to check whether your explanation is better than what the author provides.

Towards the end of the chapter, the author talks about case-based reasoning and says that in the years to come, analogical reasoning will become so powerful that it will sweep through all the fields where case-based reasoning is still employed.


Learning without a teacher

Unlike the previous chapters that focused on labeled data, this chapter is learning via unsupervised learning. Cognitive scientists describe theories of child learning using algos and machine learning researchers have developed techniques based on them. The author explains k-means algorithm, a popular clustering technique. It is actually a special case of Expectation Maximization(EM) algorithm that was invented by three Harvard statisticians. EM is used in a ton of places. To learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on them. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM. Once you have a cluster at the macro level, nothing stops you from using the same algo for each cluster and come up with sub-clusters etc.

Subsequently, the author introduces another popular technique for unsupervised learning, PCA, that is used for dimensional reduction. PCA tries to come up linear combination of various dimensions in the hyperspace so that total variance of the data across all dimensions is maximized. A step up to this algo is called "Isomap", a nonlinear dimensionality reduction technique. It connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network and finds the reduced coordinates that best approximate these distances.

After introducing clustering and dimensional reduction techniques, the author talks about "Reinforcement learning", a technique that relies on immediate response of the environment for various actions of the learner. Research on reinforcement learning started in earnest in the early 1980s, with the work of Rich Sutton and Andy Barto at the University of Massachusetts. They felt that learning depends crucially on interacting with the environment, but supervised algorithms didn’t capture this, and they found inspiration instead in the psychology of animal learning. Sutton went on to become the leading proponent of reinforcement learning. Another key step happened in 1989, when Chris Watkins at Cambridge, initially motivated by his experimental observations of children’s learning, arrived at the modern formulation of reinforcement learning as optimal control in an unknown environment. A recent example of a successful startup that combines neural networks and reinforcement learning is "DeepMind", a company that was acquired by Google for half a billion dollars.

Another algorithm that has a potential to be a part of "Master Algorithm" is chunking. Chunking remains a preeminent example of a learning algorithm inspired by psychology. The author gives a basic outline of this concept. Chunking and reinforcement learning are not as widely used in business as supervised learning, clustering, or dimensionality reduction, but a simpler type of learning by interacting with the environment is: A/B testing. The chapter ends with the author explaining another potentially killer algo, "relational learning".


The Pieces of the Puzzle Fall into Place

Progress in science comes from unifying theories; two or more seemingly disparate observations are driven by the same logic or law. If one looks at ML, it appears that Master Algorithm is akin to a unifying theory in science. It will unify all the master algorithms of each tribes, all the techniques of each tribes and give one cohesive way to learn from data.

In fact there is already a technique called "meta learning" that some of the tribes use with in their techniques. For example, bagging, random forests and boosting are some of the famous meta learning techniques used by Symbolists. Bayesians have something called "model averaging" that learns from each model by considering it as an hypothesis and then computes a score based on the vote given by each of the models. Meta learning in its current avatar is remarkably successful, but it’s not a very deep way to combine models. It’s also expensive, requiring as it does many runs of learning, and the combined models can be quite opaque.

The author used the following schematic diagram for each of the tribes, while explaining the rationale of a possible “Master Algorithm”


He then takes the reader through a tour of each of the tribes philosophy and their master algorithms and comes up a unifier, called "Alchemy", that he calls it as the "Master Algorithm". In the process of creating this master algorithm, he introduces Markov Logic Networks and says that they serve for representing the problem. Alchemy uses posterior probability as the evaluation function, and genetic search coupled with gradient descent as the optimizer. The author is wary about Alchemy’s immediate application and says that there is a ton of research that is yet to be done, so that it can become a true Master Algorithm, i.e., one that has the capability to solve hard problems.

This chapter is a little more involved as it tries to connect all the ideas from the previous eight chapters and introduces a way to combine the various pieces of puzzle to create a "Master Algorithm". The chapter will also be very interesting for an aspiring ML researcher  who is trying to pick his focus area.


This is the World on Machine Learning

The last chapter of the book discusses the world in which "Master Algorithm" is all pervasive. The author tries to speculate answers to the following questions :

  • Will humans be replaced by machines ?
  • What do you want the learners of the world to know about you ?
  • How good a model of you, a learner, can have ?
  • How will ecommerce shopping experience be ?
  • Will there be a rise of "humanities" disciple after the automation of most of non-human related tasks ?
  • What will the current online dating sites morph in to ?
  • How will the Amazons, Netflixes, Googles of the world change ?
  • What will be the privacy issues in a society where most of the transactions and activities involve, one algo talking to another algo?
  • Will future wars be fought by robots ?
  • Will robot-warfare be viable ?
  • Will AI and Master Algorithm take over the world ?

The author ends the book by saying,

Natural learning itself has gone through three phases: evolution, the brain, and culture. Each is  product of the previous one, and each learns faster. Machine learning is the logical next stage of this progression. Computer programs are the fastest replicators on Earth: copying them takes  only a fraction of a second. But creating them is slow, if it has to be done by humans. Machine learning removes that bottleneck, leaving a final one: the speed at which humans can absorb change.

takeawayTakeaway :

This book is a pop-science book for Machine Learning. ML has reached a point where it is not just for geeks anymore. Every one needs to know about it, Every one needs to at least have a conceptual model about the field as it has become all pervasive. Having said that, it would take time to plod though the book if you are a complete newbie to ML. This book is massively appealing to someone who a cursory knowledge of a few ML techniques and wants to have a 10,000 ft. view of the entire fields’ past, present and future. The future, as the title of book states is, would be the invention of a master algorithm that unifies methods across all the five tribes.



I stumbled on to this book a few weeks ago and immediately picked it up after a quick browse through the sections of the book. I had promptly placed it in my books-to-read list. I love anything related to information theory mainly because of its inter-disciplinary applications. The principles of information theory are applicable in a wide range of fields. In fact it will hard to pinpoint a specific area where concepts from information theory have not been applied. In this post, I will summarize the main points of the book.


Prologue : The Eternal War

The chapter is titled so, because there is a conflict between entropy and information. The entropy is the incessant march towards disorder. One of the ways that I can relate to is my music practice. If I don’t practice my music for long, I find it difficult to retrain my fingers and get back my muscle memory. "That which you don’t use atrophies". Entropy is also something similar. In the absence of any mechanisms to create information, the disorder of the system increases. This obviously raises a question about the mechanisms that allow the information to battle randomness and grow. The book is mainly about describing the mechanisms by which the information grows, the physical order of our world increases – that makes our planet unique, rich and uneven, from atoms to economies. The author focuses on planet earth as this is a special place where information lives, grows and hides in an otherwise mostly barren universe.

In the prologue, the author says that the book would answer the following questions:

  • What is Information ?
  • Where does it come from ?
  • Why is information concentrated on our planet?
  • Why does it grow on our planet ?
  • What are the natural, social and economic mechanisms that allow it to grow ?
  • How do the various mechanisms contribute to social and economic unevenness of the global economy ?
  • How does the social accumulation of information improve our capacity to accumulate even more information?

Introduction : From Atoms to People to Economies

The chapter starts with the story of Ludwig Boltzmann, the famous scientist who committed suicide. Though the exact reason is not known, the author speculates that it could be the apparent conflict between his theory and the order prevalent in the world. His theory was that there is always a march towards disorder, which stumped him because there were so many fascinating things in the nature that were orderly, systematic, almost giving an impression that there was a creator up there who was designing our world. The biggest sin that Ludwig committed, given the context of scientific temper at his time, was that he had worked across spatial scales. His theory made connections between atoms and gases, both belonging to different spatial scales. At that point in time, any connection between various spatial scales was considered as a sin.

At the turn of twentieth century, Ludwig was vindicated. There was immense cross-fertilization of ideas amongst many fields. Yet not all of the cross-fertilization took place near known scientific boundaries. Amid these multidisciplinary tangos, there was one concept that was promiscuous enough to play the field. This was the idea of information. In the twentieth century, the study of information was inspired by war as there was a urgent need to encode and decode messages effectively. The field took off after the revolutionary paper by Claude Shannon and Warren Weaver. Information as a concept found its followers in almost every field for the simple reason that it could be applied to microscopic as well as macroscopic worlds. It was the first truly scale independent concept. Even though the idea of information grew in prominence, many began to forget one crucial aspect of information

We forget about the physicality of information that had troubled Boltzmann. The word information became a synonym for the ethereal, the unphysical, the digital, the weightless, the immaterial. But information is physical. It is as physical as Boltzmann’s atoms or the energy they carry in their motion. Information is not tangible; it is not a solid or a fluid. It does not have its own particle either, but it is as physical as movement and temperature, which also do not have particles of their own. Information is incorporeal, but it is always physically embodied. Information is not a thing; rather, it is the arrangement of physical things. It is physical order, like what distinguishes different shuffles of a deck of cards.

One of the highlights of the work of Shannon and Weaver is that they divorced the idea of information and message. Colloquially we can use both the terms interchangeably. However the need to divorce the two was needed so that further developments in the field could happen. Whatever gets transmitted between two devices, two people, is information. It is humans who automatically interpret the information as a meaning, given the various contextual factors. This clear demarcation was given because technically , one could now focus on sending any kind of messages whether the message meant anything or not. Shannon also came up with a formula for encoding an arbitrary message with maximum efficiency. This formula looked identical to the Boltzmann’s formula.

The beauty of information being scale independent means that one can use principles of information theory to describe everything from atoms to economies. In all the previous attempts, natural sciences described the atom to human connection, the social sciences described the connection between humans and economies. Using the concept of information, one can analyze across all scales. The content of book is laid out in such a way that it describes the history of the universe, centered not on the arrow of time but on the arrow of complexity.

It is the accumulation of information and of our ability to process information that defines the arrow of growth encompassing the physical, the biological, the social, and the economic, and which extends from the origin of the universe to our modern economy. It is the growth of information that unifies the emergence of life with the growth of economies, and the emergence of complexity with the origins of wealth.

The Secret to Time Travel

This book made me look at child birth from a completely different perspective. The author compares child birth as an example of time travel; the baby is transferred from an environment(mother’s womb) that has essentially remained same since the last 1000 years in to 21st century world that is largely alien for the species. There are a ton of machines, gadgets, technologies, objects that are realizations of human knowledge and human knowhow. All the objects that we seen around embody information and imagination. The author uses two central actors, amongst many, to describe the way information grows, i.e.

  1. Physical objects: physical embodiment of information
  2. People: fundamental embodiment of knowledge and knowhow

The fundamental perspective of the author is,

Economy is the system by which people accumulate knowledge and knowhow to create packets of physical order, or products, that augment our capacity to accumulate more knowledge and knowhow and, in turn, accumulate more information.

How are humans different from other species on the planet ?

The fact that objects embody information and imagination may seem obvious. Information is a fundamental aspect of nature, one that is older than life itself. It is also an aspect of nature that accelerated with life. Consider the replication of information-rich molecules, such as DNA and RNA. The replication of DNA and RNA is not the replication of matter but the replication of the information that is embodied in matter. Living organisms are highly organized structures that process and produce information. Yet, our focus here will not be on the information-generating capacity that is embodied in the intimacy of our cells but that which emerged with humans and society. Humans are special animals when it comes to information, because unlike other species, we have developed an enormous ability to encode large volumes of information outside our bodies.

Humans are able to create physical instantiations of the objects we imagine, while other species are stuck with nature’s inventory.

The Body of the Meaningless

This chapter clarifies the differences amongst various terms used in information theory. Terms such as entropy and information are used interchangeably. Indeed they can be used in some situations but not always. Shannon’s definition of information relates to the number of bits required to encode a message with maximum efficiency. In a sense, a highly regular correlation rich structure has less information and a randomized set of instructions in a message has more information. He termed this as "entropy"(von Neumann told Shannon that calling his measure entropy would guarantee Shannon’s victory in every argument, since nobody really knew what entropy was). If I consider my laptop, it contains many documents, pictures, videos etc. In Shannon’s language, if I randomly switch the bits in my computer, the information increases. But this doesn’t go with our intuitive definition of information. Ideally the more regular, the more ordered the data is, there is more information in to it. So, there is a need to expand the definition of entropy as defined by Shannon so that one can use those concepts to talk about information that we can relate to.

The author gives a nice analogy of a half-filled stadium to show the difference between entropy as defined in statistical physics and entropy as defined by Shannon. In statistical physics, entropy is dependent on "multiplicity of states". A highly disordered system tends to have higher multiplicity of states and hence has higher entropy. However it is not necessary that a higher entropy system is necessarily more disordered. In other words, disorder can be equated to higher entropy but not always. In the physical sciences, information has always been referred to something that has order. So, in physical states, information is the opposite of entropy. The ordered states, commonly referred to as information rich states are highly correlated structures. These information rich structures are also uncommon and peculiar structures in the nature.

The author uses the example of Rubik’s cube to illustrate the rarity of ordered states in the nature. Rubik’s cube has 4.3 × 10^9 possible states and the perfect state can be obtained in less than 20 moves. However getting to this ordered state requires a specific movement of the cube that one is called a genius if he can reach to an ordered state in less than 30 moves. This example can be extrapolated to the nature. The growth of entropy is like a Rubik’s cube in the hands of a child. In nature information is rare not only because information-rich states are uncommon but also because they are inaccessible given the way in which nature explores the possible states. The author provides a few nice examples that show the connection between multiplicity of states and the ability to process information,i.e. compute

The main idea of this chapter is to look at the word "information" as defined by Shannon, and then reconcile the concept with the colloquial meaning of the word information and the work of Boltzmann.

The Eternal Anomaly

If the natural tendency of a system is to move towards disorder, move towards higher entropy, how does one explain the information explosion on our planet ? If we look around the planet, it is amazing to see so many beautiful creations of the nature. Why didn’t our planet disintegrate in to chaos ? Why does information grow on our planet ? To explain this phenomenon, the author introduces the theory put forth by Ilya Prigogine. The main idea of the theory is

Information emerges naturally in the steady states of physical systems that are out-of-equilibrium.

The author unpacks the above statement using many examples such as marble in a bowl, box filled with gas, whirlpool in a sink etc. Prigogine realized that although Boltzmann’s theory was correct, it did not apply to what we observe on Earth because our planet is an out-of-equilibrium pocket inside a larger system-the universe-that is moving toward equilibrium. In fact, our planet has never been close to any form of equilibrium. Prigogine did the math and showed that out-of-equilibrium systems give rise to information-rich steady states. So, that explains "Where information comes from ?". In an out-of-equilibrium system, such as Earth, the emergence of information is expected. It is no longer an anomaly. The bad news, however, is that entropy is always lurking on the borders of information-rich anomalies, waiting to devour these anomalies as soon as it gets the chance. Yet information has found ways to fight back. As a result, we live on a planet where information is "sticky" enough to be recombined and created. This stickiness, which is essential for the emergence of life and economies, also hinges on additional fundamental physical properties.

The author explains three mechanisms that make the information sticky. The first mechanism flows from Prigogine’s math that states that out-of-equilibrium systems self-organize into steady states in which order emerges spontaneously, minimizing the destruction of information. The second mechanism comes from Schrodinger’s theory that says Solids are essential to explain the information-rich nature of the life. The third mechanism by which information grows is matter’s ability to process information, or the ability of the matter to compute. The author explains wonderfully all the three aspects that make information "sticky"

The main idea of this chapter is to view our planet as out-of-equilibrium system. The other idea communicated by the author is that of "entropy barrier". I love this concept as it is philosophically aligned with what I believe, "Life is a Martingale".

Time is irreversible in a statistical system because the chaotic nature of systems of many particles implies that an infinite amount of information would be needed to reverse the evolution of the system. This also means that statistical systems cannot go backward because there are an infinite number of paths that are compatible with any present. As statistical systems move forward, they quickly forget how to go back. This infiniteness is what Prigogine calls the entropy barrier, and it is what provides a perspective of time that is not spatialized like the theories of time advanced by Newton and Einstein. For Prigogine, the past is not just unreachable; it simply does not exist. There is no past, although there was a past. In our universe, there is no past, and no future, but only a present that is being calculated at every instant. This instantaneous nature of reality is deep because it helps us connect statistical physics with computation. The instantaneous universe of Prigogine implies that the past is unreachable because it is incomputable at the micro level. Prigogine’s entropy barrier forbids the present to evolve into the past, except in idealized systems

Crystallized Imagination

The author starts off by giving his perspective on life

Life is all about : moving around and processing information, helping information grow while interacting in a social context.

If you reflect on the above statement a bit, I guess you will at least concur with some part of it, if not the entire statement. Our society’s ability to accumulate information requires flows of energy, the physical storage of information in solid objects, and of course our collective ability to compute. The flow of energy that keeps our planet’s information growing is clearly that coming from the sun. Plants capture that energy and transform it into sugar, and over long periods of time they degrade into the mineral fuel we know as oil. But as a species, we have also developed an amazing capacity to make information last. We have learned to accumulate information in objects, starting from the time we built our first stone axes to the invention of the latest computer.

The easiest way to get a grasp on the "accumulating information in an object" is via comparing "apple" that is product of a tree, and "Apple" product from Silicon valley. The former is a product available in the nature and we internalize in our minds while the latter is an instantiation of the knowledge in our head. Both products are packets of information, but only the latter is a crystal of imagination. The author cites two examples of MIT lab scientists who are working on robotic arms and optogenetics. They are trying to create objects that crystallize imagination, and by doing so, they are endowing our species with new capacities. The author gives several contexts where thinking about products in a different way changes several preexisting metrics and notions that we carry on in our head. For example, Chile is a potential exporter of copper and one might argue that other countries are exploiting Chile. However by looking at the value generated in the finished products that use copper, the value of copper itself goes up. So, who is exploiting whom? Isn’t Chile free-riding on the crystallized imagination of other people?

Thinking about products as crystals of imagination helps us understand the importance of the source of the information that is embodied in a product. Complex products are not just arrangements of atoms that perform functions; rather, they are ordered arrangements of atoms that originated as imagination.


The chapter is titled so, to emphasize the amplifying nature of the objects. Each object can be thought of as a crystallization of knowledge and knowhow and these objects become important to all of us because they enhance our capacities to do other things with it. Take laptop for instance. It is a product of someone else’s imagination and we get to use it to produce some other objects. There is no need to know what’s behind the hood for every object that we use. In the words of the author,

Products are magical because they augment our capacities

Objects are much more than merely a form of communication.

Our ability to crystallize imagination into products, although expressive, is different from our ability to verbally articulate ideas. An important difference is that products can augment our capacities in ways that narrative descriptions cannot. Talking about toothpaste does not help you clean your teeth, just as talking about the chemistry of gasoline will not fill up your car with gas. It is the toothpaste’s embodiment of the practical uses of knowledge, knowhow, and imagination, not a narrative description of them, that endows other people with those practical uses. Without this physical embodiment the practical uses of knowledge and knowhow cannot be transmitted. Crystallizing imagination is therefore essential for sharing the practical uses of the knowledge that we accumulate in our mind. Without our ability to crystallize imagination, the practical uses of knowledge would not exist, because that practicality does not reside solely in the idea but hinges on the tangibility of the implementation. Once again, the physicality of products-whether tangible or digital-augments us.

The main idea of this chapter is to describe products as physical embodiments of information, carrying the practical uses of knowledge, knowhow, and imagination. Our capacity to create products that augment us also helps define the overall complexity of our society.

This time, It’s personal

If we look at various products, the knowledge and knowhow for creating these products are geographically biased, though it is coming down a bit at least on the software front. The reason for this geographical bias is that crystallization of any product requires a great amount of knowledge and knowhow. The learning in almost all the cases is experimental and social. Bookish knowledge alone is not enough. You need a certain set of environment where you can interact, share ideas, experiment, learn from trial and errors. Each geographical region has its own idiosyncrasies and hence gives rise to different codifications of knowledge and knowhow. So, this means that there is certainly going to be geographical bias in the products we see. So, this naturally limits the growth of information. The author introduces a term, person-byte, meaning maximum knowledge and knowhow carrying capacity of a human. Is there a limit for human knowledge? Well, let’s talk about knowledge that one can accumulate over a period of ones working life. If I take my own example, there is a limit to how much math you can do, what kind of stats I can work on, what kind of models that I can build, the amount of code I can write. All these ultimately limit the growth of information. In that sense, a person-byte is a nifty idea that says that for information to grow, there needs to be a network of people where the collective person-bytes of the group is more than the individual person-byte.

The person-byte limit implies that the accumulation of large amounts of knowledge and knowhow are limited by both the individual constraints of social and experiential learning and the collective constraints brought by our need to chop up large volumes of knowledge and knowhow and distribute them in networks of individuals.

Links are not free

If one harks back to the time Henry Ford’s Model-T factory, it was considered as a poster child of industrial economy. It stood for the efficiency gained through scale. The output of the factory, the car, was a complex product and the rationale was, it was better to chunk out this complex task in to 7,882 tasks. It is another matter of debate whether there was a need for 7,882 individual tasks or not. One takeaway could be that complex products needs giant factories. Based on that takeaway, we should be having innumerable giant factories, given the complexity of products that we see in today’s world. This is where the author introduces a second level of quantization of knowledge and knowhow; firm-byte. This is a conceptual term that gives a upper limit on the amount of knowledge and knowhow a firm can possess. So, if a product requires more number of firm-bytes, there is a need for a network of firms. The factors that limit the size of the firm has been studied under "transaction cost theory" extensively. The author gives an overview of the theory that says

There are fundamental forces that limit the size of the networks we know as firms, and hence that there is a limit to the knowledge and knowhow these networks can accumulate. Moreover, it also tells us that there is a fundamental relationship between the cost of the links and the size of these networks: the cheaper the link, the larger the network.

It all comes down to links. If you take a typical Barbie doll, the various steps in the start to scratch process happen in twenty different countries. What has made possible this splintering up of the manufacturing process? It is not because the product is complicated. It is because the cost of creating a links between a set of firms has become easy. This could be attributed to reducing transportation costs, revolution in communication technologies, standardization of parts etc. In all the cases where market links have become cheaper, we have seen vast networks of firms participating together. There are innumerable examples that fall in to this category(iPad, iPhone,laptops,cell phones,…)

Does it mean that making the cost of market links cheaper will automatically give rise to increase in information via crystallization of many other products? Not necessarily. We observe links that are inherently expensive depending on the frequency and specificity of the transaction.

In Links We Trust

This chapter explores the role of "trust" in formation of networks. Social networks and social institutions help determine the size, adaptability, and composition of the networks humans need to accumulate knowledge and knowhow. When it comes to size, the ability of societies to grow large networks is connected to the level of trust of the underlying society. When it comes to the composition of networks, social institutions and preexisting social networks affect the composition of the professional networks we form in two important ways. On the one hand, a society’s level of trust determines whether networks are more likely to piggyback on family relations. On the other hand, people find work through personal contacts, and firms tend to hire individuals who trace the social networks of their employees.

The takeaway from this chapter is that social networks and institutions are also known to affect the adaptability of firms and networks of firms.

The Evolution of Economic Complexity

If one’s intention were to study the geographical distribution of knowledge and knowhow, one inevitably comes up with an issue- knowledge and knowhow are intangibles. How does one cull out of these things for various geographies ? The author’s first attempt is to look at the location of various industries that produce complex objects to simple objects. In this context, he uses the concept of "nestedness" from ecology and does number crunching to show that

There is a clear trend showing that the most complex products tend to be produced in a few diverse countries, while simpler products tend to be produced in most countries, including those that produce only a handful of products. This is consistent with the idea that industries present everywhere are those that require less knowledge and knowhow.

The author ties back his person-byte theory to the observations from the data. In a sense, the inference is commonsensical. The knowledge and knowhow of specialized products is sticky and biased towards specific geographical areas where or ubiquitous products, the knowledge and knowhow is spread across a wide range of geographies.

The Sixth Substance

If one looks at the models describing economic growth, the five common factors used in the literature are

  1. Land
  2. Labor
  3. Physical Capital
  4. Human Capital
  5. Social Capital

The author connects these five factors to the principles explained in the previous chapters. For example, the physical capital is the physical embodiment of information that carries the practical uses of the knowledge and knowhow used in their creation. Physical capital is made of embodied information and it is equivalent to the crystals of imagination described in the previous chapters. The author introduces a metric "economic complexity" that takes in to consideration diversity of exporting country, diversity of the country to which export is being made, the ubiquity of the product exported. The author tests his model for predictive accuracy and shows that it performs well.


The last section of the book highlight the main points from the book. In a sense, it makes my summary redundant as the author provides a far more concise summary. So, if you are short on time, you might just want to go over the last 10 pages of the book.


There is no denying about the importance of Silence and Solitude in one’s life. For me, they have always provided an appropriate environment to learn and understand a few things deeply. Drawing from that experience, I strongly feel one should actively seek some amount of "silent time" in one’s life. Is it difficult for a person leading a married life, to carve out spaces of silence? Not necessarily. I remember reading a book by Anne D.LeClaire, in which the author writes about her experience of remaining completely silent on the first and third Mondays of every month. Anne explains that this simple practice brought tremendous amount of calmness in her family life. The family members unknowingly start giving importance to "pauses", the "pauses" that actually make the sentence meaningful, the "pauses" that make the music enjoyable, the "pauses" that make our lives meaningful. Indeed many have written about the transformative experience of silence. But how many of us consciously seek silence and more importantly incorporate in our daily lives ? In the hubbub of our lives and in our over-enthusiasm for acquiring/reaching/grabbing something that is primarily externally gratifying, we often turn our back on "silence" and consequently deny or at least partly deny those experiences that are internally gratifying.

I picked up this book almost 2 months ago. For various reasons, it remained in my inventory for quite some without me having a peaceful go at it. In mid-December 2015, after living in Mumbai for 6.5 years, I decided to leave Mumbai for personal reasons. I had spent the first few weeks of December shipping most of my stuff and vacating the rented flat. Those weeks were undeniably very exhausting as I had a ton of books that had to be sorted, categorized and shipped to different places. Once I had shipped everything, the house was literally empty. Except for a few clothes of mine, and my Sitar, the house was totally empty and silent. For some reason, I felt totally liberating in that empty and silent house. In that context, I set out to read this book. In this post, I will briefly summarize the main points from the book.


The author starts off by talking about the excessive importance we give to material comforts and affective concerns

In our daily lives many of us spend most of our time looking for comforts-material comforts and affective comforts-in order to merely survive. That takes all our time. These are what we might call the daily concerns. We are preoccupied with our daily concerns: how to have enough money, food, shelter, and other material things. We also have affective concerns: whether or not some particular person loves us, whether or not our job is secure. We worry all day because of those kinds of questions. We may be trying to find a relationship that is good enough to endure, one that is not too difficult. We’re looking for something to rely on.

We may be spending 99.9 percent of our time worrying about these daily concerns-material comforts and affective concerns-and that is understandable, because we need to have our basic needs met to feel safe. But many of us worry far, far beyond having our needs met. We are physically safe, our hunger is satisfied, we have a roof over our heads, and we have a loving family; and still we can worry constantly.

The deepest concern in you, as in many of us, is one you may not have perceived, one you may not have heard. Every one of us has an ultimate concern that has nothing to do with material or affective concerns. What do we want to do with our life? That is the question. We are here, but why are we here? Who are we, each of us individually? What do we want to do with our life? These are questions that we don’t typically have (or make) the time to answer.

These are not just philosophical questions. If we’re not able to answer them, then we don’t have peace-and we don’t have joy, because no joy is possible without some peace. Many of us feel we can never answer these questions. But with mindfulness, you can hear their response yourself, when you have some silence within.

What we all need is "silence" to tune in to ourselves.

A Steady diet of noise

This chapter is mainly about realizing the kind of noise that pervades our minds. Cows, goats and buffalo chew their food, swallow it, then regurgitate and rechew it multiple times. We may not be cows or buffalo, but we ruminate just the same on our thoughts – unfortunately, primarily negative thoughts. We eat them, and then we bring them up to chew again and again, like a cow chewing its cud. The author calls this incessant noise, NST (Non Stop Thinking) Radio Station. Unconsciously many of us are constantly listening to NST and do not take time out to truly listen to what our heart needs. To understand the kind of thoughts that we constantly consume via NST, the author classifies them as follows :

  1. Edible food: What we eat affects how we feel. Imagine for a second that you overeat something you like. It is similar to a seizure where you cannot control yourself and give in to it. The immediate feeling after this overdose is usually laziness, boredom and a dull brain. You try concentrating on a thing and you realize it becomes difficult. So, something as elementary as edible food that is necessary for our survival becomes nourishing or toxic, depending on what we consume, how much we consume, and how aware we are of our consumption.
  2. Sensory food: Sensory food is what we take in with our senses and our mind – everything we see, smell, touch, taste and hear. This type of food has a far more influence on how we feel. Some of us are forever open to this external world. All the windows and doors are eternally open to this external world which throws a barrage of sensory stimuli. Most of the sensory food we consume is useless at best, harmful at worst. How often we keep watching a pathetic TV program and still lack the power to shut it off ? We often become paralyzed to sensory foods and become slaves to it. Kids are often introduced video games by parents and then the sensory food that kids derive is so addictive that kids start behaving like the characters in the video game. Imagine a kid who plays a game involving violence; Do you think he will generate calmness and a sense of balance in his mind ? No way. Same is the case with conversations. Suppose you talk to a person who is full of bitterness, envy or craving. During the conversation, you take in the person’s energy of despair. Even though you had no ill-feelings in your mind to begin with, you mind will be infected with such feelings as you begin conversing with such people. One easy way to avoid such a situation is to leave such a company and go else where. If you are in a situation where you are forced to be in such a company for whatever reasons, the next best thing is to be aware of the kind of thoughts the other person is emanating. Awareness makes you immune to the toxic sensory food that you come across in your daily life.
  3. Volition: Our primary intention and motivation is another kind of food. It feeds us and gives us purpose. Like the previous kinds of food, it can be extremely nourishing or extremely toxic based on the kind of intent and motivation levels. So much of the noise around us, whether advertisements, movies, games, music, or conversation, gives us messages about what we should be doing, what we should look like, what success looks like, and who we should be. Because of all this noise, it’s rare that we pay attention to our true desire. We act, but we don’t have the space or quiet to act with intention. If what you are doing is what your heart truly desires, then the associated work becomes a bliss. You don’t have to bother how other people look at your behavior, action and work. As long as you are clear that it is what you truly enjoy and desire, there will be little chance for toxic thoughts to arise. What you truly want to do, is something that is not that easy to figure out, if you are continuously tuned out. It requires some time out from the rat race, a period of solitude that gives you space to understand yourself.
  4. Individual and Collective consciousness: Even if we go on a sensory fast, we still feed our thoughts from our consciousness. The best way to describe this is to think of it has two storeyed building. We are forever planting seeds in the lower storey of the house. The seeds could be pleasant ones or unpleasant ones. These seeds are being watered constantly by us. Which of these seeds we water, is dependent on our individual and collective consciousness. If we are in a toxic environment, automatically, without our notice, we water crappy seeds and they show their colors with vengeance, making us feel unpleasant about it. As they say, even if you take a person to Himalayas, you cannot the take the person away from him. By consciously choosing what and who you surround yourself with, is among the keys to finding more space for joy

The takeaway from this chapter is that, we need to be aware of the kind of foods we are taking in. By being aware, it is likely that toxic foods do not enter us. By being aware, it is likely that we entertain healthy foods in to us, thus turning us in to a peaceful and a wholesome person.

Radio Non Stop Thinking

The antidote to NST is mindfulness. By having mindfulness in all the activities you perform, you will be able to stop the NST radio in your head. Shifting our attention away from our thoughts to what’s really happening in the present moment is a basic practice of mindfulness. We can do it anytime, anywhere, and find more pleasure in life. Whether we’re cooking, working, brushing our teeth, washing our clothes, or eating, we can enjoy this refreshing silencing of our thoughts and our speech. Mindfulness entails finding the inner quietness.


Thundering Silence

Many times we consume different kinds of foods mentioned in the previous chapters as a response to the compulsive urge to avoid ourselves. Whenever we try to confront the unpleasant part of us, we know that we should be letting it go. But we hold on to it. Our "knowing that we should let it go" and we actually letting it go, are two vastly different things. The latter requires us to remain silent and go to the very source of it, acknowledge it, appreciate it for whatever it has taught in our life and then let it go. With out this phase of "examining it in silence", we will forever be trying to let it go but never actually letting it go.

What is the essence of stillness ?

When we release our ideas, thoughts, and concepts, we make space for our true mind. Our true mind is silent of all words and all notions, and is so much vaster than limited mental constructs. Only when the ocean is calm and quiet can we see the moon reflected in it. Silence is ultimately something that comes from the heart, not from any set of conditions outside us. Living from a place of silence doesn’t mean never talking, never engaging or doing things; it simply means that we are not disturbed inside; there isn’t constant internal chatter. If we’re truly silent, then no matter what situation we find ourselves in, we can enjoy the sweet spaciousness of silence.

The chapter’s title is "thunderous silence", a kind of silence that is  opposite to oppressive silence

Suppose you sit outside and pay attention to the sunshine, the beautiful trees, the grass, and the little flowers that are springing up everywhere. If you relax on the grass and breathe quietly, you can hear the sound of the birds, the music of the wind playing in the trees. Even if you are in a city, you can hear the songs of the birds and the wind. If you know how to quiet your churning thoughts, you don’t have to turn to mindless consumption in a futile attempt to escape from uncomfortable feelings. You can just hear a sound, and listen deeply, and enjoy that sound. There is peace and joy in your listening, and your silence is an empowered silence. That kind of
silence is dynamic and constructive. It’s not the kind of silence that represses you. In Buddhism we call this kind of silence thundering silence. It’s very eloquent, and full of energy.

The author ends the chapter with a few simple exercises that one can perform anywhere to renew yourself and energize yourself.

The Power of Stillness

The author says, we rarely notice our breathing patterns and rarely do we enjoy our breathing. Some people carry around a notion that one has to add an additional item, "Meditation" in to their agenda. However it is far easier than that. All one has to do for practicing mindfulness is to reorient yourself and remember your true intention. Quiet, mindful breathing is something you can do at any time. Wherever you are can be a sacred place, if you are there in a relaxed and serene way, following your breathing and keeping your concentration on whatever you’re doing. The simple process of sitting quietly on a regular basis can be profoundly healing. The author offers a simple exercise for beginners; dedicate five minutes every day to walking quietly and mindfully. I guess the proof of the pudding is in the eating. You can try it out and see if it calms down your mind and make you more aware of the thoughts, feelings and NST radio in your head.

Paying Attention

One of the often asked questions involves the relevance of following mindfulness during tasks that are inherently banal. The author says that by practicing mindfulness at all times, it becomes easier to access our "island of self" during the times we actually need it. A related idea is developing the capacity to be alone or being in solitude. There are two dimensions to solitude. The first is to be alone physically. The second is to be able to be yourself and stay centered even in the midst of a group. The former appears easy and latter appears difficult, though it might be appear vice-versa for a different person. Paying attention towards anything necessitates that one is comfortable with solitude, for great technologies, ideas, inventions are a result of paying deep attention on something and then actualizing the imagination in to the real world.

Cultivating Connection

One of the ways to cultivate connection with others is to listen deeply. What does it mean to listen deeply? It basically entails stopping the NST radio, being silent and truly listening to others without forming any judgment. Sometimes if you are lucky, you will befriend a person whose company you can enjoy even without talking. The mere presence of that person who is silent can make your joyful. The author says that two people being together in silence is a very beautiful way to live. Solitude is not found only by being alone in a hut deep in the forest; it is not about cutting ourselves off from civilization. We do not lose ourselves; we do not lose our mindfulness; Taking refuge in our mindful breathing, coming back to the present moment, is to take refuge in the beautiful, serene island that each of us has within. If we carve out little moments of spaciousness in the various activities of our lives for this kind of quiet, we open ourselves up to the ultimate freedom. Whoever be the person you are trying to make connection with,friend or sibling or parent or relative or colleague, spending time in silence together is one of the best ways to forge long lasting relationships.


There are many other little hacks that the author dishes out for practicing mindfulness. Some of them are

  1. Digital Nirvana for a day
  2. Try to remain silent during a specific time period of the day or the week. I remember reading a wonderful book, titled, "Listening below the noise", in which the author follows "Silent Monday" ritual in her family. The book goes on to show innumerable benefits of this simple ritual to all her family members
  3. Some tool/gadget that reminds you to concentrate back on the present. In this context, I find Pomodoro technique to be a very effective mechanism to create a sequence of focused and relaxed times.

takeawayTakeaway : 

Irrespective of whether you take "pauses" in your life or not, I think it might be a good thing to take a "pause" and read this book. If not anything, you could find some useful hacks to lead a peaceful life.


The book titled, "The Golden ticket", gives a non-technical introduction to the most important unsolved problem in the computer science,"Whether P = NP?". If you are curious about knowing this problem and do not want to slog through the dense math behind it, then this book might be a perfect introduction to the problem. The author makes sure that the concepts are introduced using plain English and nothing more. By the time you are done with the book, I bet you will start appreciating a few if not all the NP-complete problems that are yet to be cracked. In this post, I will try to summarize briefly the main points of the book.


The author starts off by stating the P vs. NP challenge as one of seven recognized Millennium prize problems by the Clay Mathematical Institute. P refers to the class of problems that can be solved efficiently using computers. Efficient here means using a polynomial time algorithm. NP refers to the class of problems to which we would like to find the best solution. Another way to state NP class of problems is the set of problems for which a we can efficiently check its solution. So, the million dollar question is, whether the set of all problems that can be solved efficiently and the set of problems that for which we can efficiently check its solution are one and the same. For someone who is coming across this for the first time, it might be a bit puzzling. Why is "having a quick algo for verifying a solution?" for a problem important ? What’s the connection between an algo used for verifying a solution to a problem and an algo used for solving the problem? This question is answered elaborately in the chapters to follow.

The Golden Ticket

The author clarifies the intent behind using the phrase, "The Golden Ticket", in the title of the chapter. It acts as a cue to the story of a candy manufacturer who places a handful of golden tickets inside chocolate bars, hidden among tens of millions of bars produced each year. The finders of the golden tickets will get a free factory tour. If you want to desperately find the golden ticket, How do you find one ? No matter how you find the ticket, you will need a considerable amount of time, money and luck. This is a nice example to have in mind as it reminds the user that there are problems that we encounter in our lives for which the solutions take considerable amount of time, no matter how much computing power you have.

The chapter introduces the traveling salesman problem, a problem that involves selecting the best route to minimize the total distance travelled. The problem looks deceptively simple. In fact for 48 city tour, if you intend to check all the tour combinations on a laptop, it will take 10 trillion trillion times the current age of the universe. There are many problems such as TSP that fall under NP category, i.e the class of problems for which we want to find the solution. P refers to the class of problems for which one can find an efficient solution. P=NP means we can quickly find a solution to all the NP problems. P ≠ NP means we can’t. Do we come across NP problems in real world or Are they just a figment of imagination for a pure mathematician ? Well, NP-complete problems are everywhere. For example, when you plug in your start and end points in a GPS system, the possible combinations of routes are too many for one to do a exhaustive search. Instead the GPS finds the best route heuristically.

The chapter ends with a list of problems that the author promises to answer via this book, i.e.

  1. What are P and NP?
  2. What kinds of problems do they capture?
  3. What are NP-complete problems, the hardest of all search problems?
  4. How do these problems capture the P versus NP question?
  5. How did we get to P and NP?
  6. How do we approach a proof that P ≠ NP ?
  7. Kurt Gödel showed that mathematics cannot solve every problem. Can similar techniques show that there are search problems we can’t solve quickly?
  8. What good can come out of P ≠ NP?
  9. Can future computers based on quantum mechanics make the P vs.NP problem irrelevant?
  10. What about the future? The great challenges of computing still lie ahead of us. a) How do we deal with computers that have to work together to solve problems? b) How do we analyze the massive amount of data we generate every day? c) What will the world look like when we network everything?

The Beautiful World

The author gives a futuristic tour of various domains, if P=NP were to be proved conclusively. There are examples on the brighter side as well as the darker side. Cancers can be cured, Sports as we know it will get completely revolutionized, Politics and the whole activity of campaigning will turn in to a game of algos, Police would be crack cases in days what used to take months and years. However the dark side of it is that machines will become so damn powerful that there will be privacy issues, public key-private key encryption will be falter and ecommerce will fall back to a more primitive encryption, more people will lose jobs as machines become efficient in doing cognitive jobs too. These scenarios help a reader to visualize the future if ever P=NP were to be proved.

P and NP

The author uses an example of a fictional land,"Frenemy" to illustrate various types of problems that fall under class P. What does Frenemy look like?

Frenemy has about 20,000 inhabitants. Every individual seems normal, but put two in close vicinity and a strange thing happens. Either the two take an instant liking to each other, immediately becoming the best of friends, or they take one look at each other and immediately become the worst of enemies. Despite the Frenemy name, two inhabitants are never observed taking a middle ground; they are always either close friends or distant enemies.

If one were to check whether six degrees of freedom holds good for any two individuals in Frenemy by brute force, i.e. counting all the possible links of six links, this problem will take more than 100 years to compute. There are simple algos that make such tasks computationally tractable. To find the lengths of separations between all pairs of inhabitants of Frenemy, one can use Floyd-Warshall algorithm, which will compute all these distances in about eight trillion computation steps. Given the modern day computing power of about billion operations per second, this can be done pretty quickly. The author uses various types of examples using Frenemy citizens to illustrate various problems and algos.

  1. Claude Berge Algorithm
  2. Jack Edmonds Algorithm – In a paper, he was the first to use the term,"algebraic time" to capture the intuitive idea of efficiency of an algorithm. Later the term was replaced by "polynomial time". The class of problems that can be solved in "polynomial time" is referred to as P.
  3. No mathematically rigorous proof that shows a map can be colored using four colors.
  4. Algorithm for finding Eulerian path
  5. Min-cut Problem in graph theory
  6. No polynomial time algo for finding Cliques
  7. No polynomial time for finding Hamiltonian Cycles
  8. No polynomial time for Max-cut problem
  9. No polynomial time for coloring maps

The last four of the above problems, as they stand today, have no polynomial time algo to find the solution. However given a potential solution, you can easily check whether it is indeed a solution or not. The class of problems such as these is called NP.

There are a ton of problems for which we do not know how to find the solutions quickly. The author goes on to give a list of NP problems in various fields

  1. Genomics – Current DNA sequencing technologies can only sequence about 1000 base pairs at a time and shortest chromosome has 47 million base pairs. How best to put these pairs together and get a full sequence is an NP problem
  2. How does a protein fold based on mRNA sequence is also a NP hard problem
  3. Optimization problems can be NP hard problems. Finding the minimum energy of a physical system can be an NP hard problem
  4. There are many NP problems in game theory

The Hardest Problems in NP

A logic expression is satisfiable if there is someway to find a solution that satisfies the logic expression. If there is a quick algorithm for solving the satisfiable problem, then by "reduction", you can use the same algorithm for solving the original problem. The author shows the clique problem can be reduced to satisfiability problem. It was Steve Cook, an University of Toronto professor who showed that "Satisfiability" is a good candidate for an interesting problem not efficiently computable. Even though Cook was the first to notice the importance of the problem, it was Richard Karp, a Berkley professor who converted the satisfiability problem to clique problem. Cook and Karp’s work meant that satisfiability for a clique is easy to compute if and only if clique was easy to compute. Thanks to the connection between satisfiability and solvability, there were now many different problems that are equivalent in terms of computational hardness. Solve any of the problem efficiently, and you have the solution to all the problems, i.e P=NP. These problems did not emanate from some pure mathematician’s head but arose from the practical world. It was Donald Knuth who ran a poll to figure out a name for the set of problems in NP that are hardest and the poll suggested the name, "NP complete".

The author gives the following examples from the real world that are NP-complete:

  1. Scheduling operations in a Coke plant
  2. How to go for a Disney tour so that you spend minimum amount of time ?
  3. Figuring out the dominating set in a graph
  4. Large Sudoko games
  5. Minesweeper game
  6. Tetris
  7. Graph Isomorphism
  8. Factoring a prime number
  9. Integer Linear Programming problems

The Prehistory of P versus NP

Another way to verbalize P=NP is : Do you need a brute force search to solve a problem or do you have a smart algorithm for solving the problem ? This chapter gives the historical background to the P versus NP problem. There are two strands of developments mentioned, one in the Russian camp and other the Western nations camp. The following personalities figure on the Western nations camp :

  • Warren McCulloch
  • Walter Pitts
  • Juris Hartmanis and Richard Stearns
  • Jack Edmonds
  • Alan Cobham
  • Steve Cook
  • Richard Karp

And on the Russian side,the following spearheaded the effort:

  • Sergey Yablonsky, who first focused on perebor for finding the smallest circuits for computing functions but whose power and arrogance worked against the development of computational complexity in Russia.
  • Andrey Kolmogorov, the great Russian mathematician, who measured complexity via algorithmic information.
  • Kolmogorov’s student Leonid Levin, who independently developed the P versus NP problem and the theory of NP-completeness.

Dealing with Hardness

NP-complete problems are nasty creatures. Should one give up whenever one stumbles on to a variant of NP-complete problem. Not necessarily. This chapter gives a list of alternative approaches that can be used to solve moderate sized problems.

  • Brute force : Thanks to Moore’s law, the computing power of today’s computers help one use a brute force approach to find the solution. For example TSP on 1000 cities can be solved using today’s technology
  • Heuristics
  • Search for a smaller version of the problem
  • Approximate solutions

There are many other problems though, that do not yield to any of the above hacks and they are many out there that are NP-hard.

Proving P ≠ NP

This chapter provides a list of approaches that various people have taken to prove that P is not equal to NP

  • Using Paradoxes : The author lists "I am liar" paradox to show that in mathematics, there is a certain appeal to using paradoxes. Godel upended the foundations of mathematics by showing that there are true statements that one cannot prove as true. Halting problem is another type of problem that can be understood via paradox. There simply can never be any program that can tell whether another program finishes. However the author shows that using this approach to prove P is not equal to NP isn’t going to work.
  • Circuits : Every IC in a computer has transistors and each of these transistors implement logic gates. Every function, no matter how complicated, can be computed by a circuit of AND, OR, and NOT gates. What’s the connection with circuits ? Any problem that has computationally efficient solution can have a reasonably small. There have been many efforts by various people to use circuits to prove or disprove N=NP, but none have stood the test of rigor.

Where are we in the efforts to proving P vs. NP?

The only known serious approach to the P versus NP problem today is due to Ketan Mulmuley from the University of Chicago. He has shown how solving some difficult problems in a mathematical field called algebraic geometry (considerably more complex than high school algebra and geometry) may lead to a proof that P ≠ NP. But resolving these algebraic geometry problems will require mathematical techniques far beyond what we have available today. A few years ago Mulmuley felt this program would take 100 years to play out. Now he says that is a highly optimistic point of view. How long until we see an actual resolution of the P versus NP problem? Maybe the problem will have been resolved by the time you read this. Far more likely the problem will remain unsolved for a very long time, perhaps longer than the 357 years needed for Fermat’s Last Theorem, or perhaps it will forever remain one of the true great mysteries of mathematics and science.


This chapter gives a brief introduction to cryptography and in doing so, shows how much of the modern cryptography relies on P ≠ NP. It starts off by giving a 10000 ft. view of public key and private key cryptography. One cannot help but notice that the fact that since factoring is assumed to be NP hard, today we have a convenient secured communication. Every time you see a "https" in your browser URL, it is the public key-private key in action. In the future, if someone were to prove P= NP, the modern cryptography would fall apart. So, does it signify "End of the World" ? Not really. There are techniques like one-time pads that can be used to create secure communication channels, the downside though is that the method of providing these channels become painful.

The author gives an example of Sudoko to illustrate the concept of "zero knowledge proof" Sudoku is NP-complete, so every NP problem can be reduced to Sudoku. So we automatically get a zero-knowledge system for every NP problem. In a common attack in cryptography, a person can cheat by pretending to be someone he or she isn’t. Zero-knowledge proofs give a method to guard against such identity spoofing. There is also a brief mention about the fully homomorphic encryption and Pseudorandom number generators for creating secure communication.


This chapter gives a bird’s eye view on Quantum computing and its relevance to the P vs. NP problem. Unlike the traditional bit that is used in the computers, quantum computing uses something called "qubit". The chapter gives a range of potential applications that can make many NP problems tractable. However the author is cautious about the time it might take for us to see real applications of quantum computing.

Full-scale quantum computers will remain special-purpose machines, useful for factoring  numbers or simulating quantum systems. They may help us break codes and help us better understand the fundamental nature of the universe, but they likely won’t solve NP-complete problems or make spreadsheets run faster.

The Future

By the time a reader hits the last chapter, it is clear that the author has a bleak outlook for the P vs. NP problem. However he says that the problem gives a common framework to solve great computational tasks that lay ahead of us.

  • Parallel computation: Computers used to double in speed every eighteen to twenty-four months, but now we are hitting physical limits that prevent much faster processors in the future. Instead, computers are multiplying, and we can have several processors working together on a chip or in a cloud. How can we adjust our algorithms to work in this growing parallel world? The class of the problems that can be solved by parallel computing is called NC. So, the mystery deepens. Is P=NC ? Is NC = NP?
  • Big data: From the Internet to scientific experiments and simulations, we create huge amounts of data every day. How do we reason about, make sense of, learn from, and develop predictions from this massive amount of information?
  • Networking of everything: Much of the world lives on some computer network, whether they use some social network like Facebook or just communicate via email. Soon nearly everything man-made will also be part of this network, from the clothes we wear to the light bulbs we read by. How do we make the best use of this ultraconnected world?

The author concludes the book by saying,

Computation is about process, and not just on computers. The P versus NP problem has to do with the limits on nature itself, on how biological and physical systems evolve, and even on the power of our own minds. As long as P versus NP remains a mystery we do not know what we cannot do, and that’s liberating.


The book is more than just describing a particular unsolved problem in computer science. It takes a whirlwind tour of many problems across several domains and gives the reader a beautiful perspective about what could happen if ever the P vs. NP problem were to solved. Even if the problem is not solved for another century, the framework that can be obtained by thinking through the problem would make one’s mind richer.