This book gives a a macro picture of  machine learning. In this post, I will briefly summarize the main points of the book. One can think of this post as a meta summary as the book itself is a summary of all the main areas of machine learning.



Machine learning is all around us, embedded in technologies and devices that we use in our daily lives. They are so integrated with our lives that we often do not even pause to appreciate its power. Well, whether we appreciate or not, there are companies harnessing the power of ML and profiting from it. So, the question arises, whether we need to care about ML at all ? When a technology becomes so pervasive, you need to understand a bit. You can’t control what you don’t understand. Hence at least from that perspective, having a general overview of the technologies involved, matters.

Machine learning is something new under the sun: a technology that builds itself. The artifact in ML is referred to as a learning algorithm. Humans have always been designing artifacts, whether they are hand built or mass produced. But learning algorithms are artifacts that design other artifacts. A learning algorithm is like a master craftsman: every one of its productions is different and exquisitely tailored to the customer’s needs. But instead of turning stone in to masonry or gold in to jewelry, learners turn data into algorithms. And the more data they have, the more intricate the algorithms can be.

At its core, Machine learning is about prediction: predicting what we want, the results of our actions, how to achieve our goals, how the world will change.

The author says that he has two goals in writing this book:

  • Provide a conceptual model of the field,i.e. rough knowledge so that you can use it effectively. There are many learning algorithms out there and many are being invented every year. The book provides an overview of learning algos by categorizing the people who use them. The author calls each category a tribe. Each tribe has its own master algorithm for prediction
    • Symbolists: They view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic.
      • Master algorithm for this tribe is inverse deduction
    • Connectionists: They reverse engineer the brain and are inspired by neuroscience and physics.
      • Master algorithm for this tribe is backpropagation
    • Evolutionaries: They simulate evolution on the computer and draw on genetics and evolutionary biology.
      • Master algorithm for this tribe is genetic programming
    • Bayesians: They believe that learning is a form of probabilistic inference and have their roots in statistics.
      • Master algorithm for this tribe is Bayesian inference
    • Analogizers: They learn by extrapolating from similarity judgments and are influenced by psychology and mathematical optimization.
      • Master algorithm for this tribe is Support vector machines.

In practice each of the master algorithms that a particular tribe uses is good for some type of problems but not for others. What we really want is a single algorithm combining the key features of all of them, The Master Algorithm

  • Enable the reader to invent the master algorithm. A layman, approaching the forest from a distance, is in some ways better placed than the specialist, already deeply immersed in the study of particular trees. The author suggests the reader to pay attention to each tribe and get a general overview of what each tribe does and what tools that each tribe uses. By viewing each tribe as a piece of a puzzle, it would be easy to get a conceptual clarity of the entire field as a whole.

The Machine Learning Revolution

An algorithm is a sequence of instructions telling a computer what to do. The simplest algorithm is : flip a switch. The second simplest algorithm is : combine two bits. The idea connecting transistors and reasoning was understood by Shannon and his masters thesis lead to the most important scientific discipline – information theory. Computers are all about logic. Flipping a set of transistors is all what any algorithm does. However behind this benign activity, some of the most powerful algorithms go about doing their work by using some preexisting algos as building blocks. So, if computing power increases, do all the algos automatically become efficient and all powerful? Not necessarily. The serpent in the garden goes by the name "complexity monster". There are many heads to this complexity monster.

  • Space complexity : the number of bits of info that an algo needs to store in the computer’s memory
  • Time complexity : how long the algo takes to run
  • Human complexity : when algos become too complex, humans cannot control them and any errors in the algo execution causes panic.

Every algorithm has an input and an output – the data goes into the computer, the algo does the job and gives out the result. Machine learning turns this around : in goes the data and the desired result and out comes the algorithm that turns one in to the other. Learning algorithms – learners- are algorithms that make other algorithms. With machine learning, computers write their own programs. The author uses a nice low tech analogy to explain the power of machine learning :

Humans have known a way to get some of the things they need by letting nature make them. In farming, we plant the seeds, make sure they have enough water and nutrients, and reap the grown crops. The promise of machine learning is that the technology can mirror farming. Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.

This analogy makes two things immediate. First the more data we have, the more we can learn. Second, ML is a sword that can slay the complexity monster. If one looks at any learning algorithm, it does broadly two things. It either learns knowledge or learns some skills, i.e. it learns knowledge in terms of statistical models or learns procedures that underlie a skill. The author talks about another analogy in the information eco-system. He equates databases, crawlers, indexers and so on as herbivores, patiently munging on endless fields of data. Statistical algos, online analytical processing are the predators and learning algorithms are the super predators. Predators turn data in to information and Super predators turn information in to knowledge.

Should one go and spend years in computer science discipline to become ML expert ? Not necessarily. CS makes you think deterministically and ML needs probabilistic thinking. The difference in thinking is a large part of why Microsoft has had a lot more trouble catching up with Google than it did with Netscape. A browser is just a standard piece of software, but a search engine requires a different mind-set.

If you look at all the fantastic stuff behind ecommerce applications, it is all about match making; producers of information are being connected to consumers of information. In this context ,learning algorithms are the match makers. The arsenal of learning algos will serve as a key differentiator between the companies. Data is indeed the new oil.

The chapter ends with a discussion of how political campaigning is being revolutionized by ML. With truck loads of data at each voter level, politicians are turning to ML experts to help them with campaign analysis and directed targeting. The author predicts that in the future, machine learning will cause more elections to be close. What he means by that is that learning algorithms will be ultimate retail politicians.


The Master Algorithm

If you look at any of the algorithms such as nearest neighbor, decision tree, naive Bayes etc., they are all domain agnostic. This warrants a question, "Can there be one learner that does everything, across all domains?". If you have used let’s say a frequentist way of estimating a parameter and subsequently check it with a Bayesian inference, then both parameters give almost the same distribution if there is a LOT of data. Under a deluge of data, you will most certainly get a convergence of parameters in a model, irrespective of the model used. However the reality is that there is a scarcity of data, that necessitates making assumptions. Depending on the type of problem and depending on the type of assumption, certain class of learning models are better than the rest. If one speculates the presence of a master algorithm, then the assumptions also need to go in as input. This chapter explores the presence of a master algorithm that can be learnt from input data, output data and the assumptions. The author grandly states the central hypothesis of the book as

All knowledge-past, present, and future-can be derived from data by a single, universal learning algorithm.

The author speculates the presence of a master algorithm and says that there are arguments from many fields that speculate the presence of one.

The argument from Neuroscience : The evidence from the brain suggests that it uses the same learning algorithm, throughout, with the areas dedicated to the different senses distinguished only by the different inputs they are connected to. In turn, the associative areas acquire their function by being connected to multiple sensory regions, and the "executive" areas acquire theirs by connecting the associative areas and motor output.

If you look at any cortex in a human brain, you find that the wiring pattern is similar. The cortex is organized in to six different layers, feedback loops, short range inhibitory connections and long-term excitatory connections. This pattern is very much common across the brain. There is some variation in the patterns but the structure pretty much is the same. The analogy is that the it is the same algo with different parameters and settings. Low-level motor skills are controlled by cerebellum that has a clearly different and regular architecture. However experiments have shown that cerebellum can be perfectly replaced by cortex. So, this again goes to suggest that there is some "master algorithm" controlling the entire brain.

If you look at a set of learning algos in the ML field, you can infer that at some level they are trying to reverse engineer the brain’s function. One of the five ML tribes, Connectionists, believe in this way of modeling the world.

The argument from Evolution : If one looks at the evolution of species since the beginning of earth, one can think of natural selection or whatever the nature does as an algorithm. This algorithm has all the species as inputs and the species at any given point in time as the output. The master algo does the work of eliminating certain species, allowing certain species to mutate, etc. It is a dynamic process where the outputs are again fed as inputs. This line of thought makes one speculate the presence of a master algorithm. In fact one of the five tribes in the ML world, Evolutionaries, strongly believe in this way of modeling.

The argument from Physics : Most of the physics is driven by simple equations that prune away all the noise in the data and focus on the underlying beauty. Physics laws discovered in one domain are seamlessly applied to other domains. If everything we see in the nature could be explained by few simple laws, then it makes sense that a single algorithm can induce all the can be induced. All the Master Algorithm has to do is provide a shortcut to the laws’ consequences, replacing impossibly long mathematical derivations with much shorter ones based on actual observations. Another way to look at scientific disciplines is to think of various laws, states as outcomes of an dynamic optimization problem. However physics is unique in its simplicity. It’s only reasonably effective.

Machine learning is what you get when the unreasonable effectiveness of mathematics meets the unreasonable effectiveness of data.

The argument from Statistics : Bayesians look at the world from a probabilistic and learning mindset. Bayes rule is a recipe to turn data in to knowledge. In the yesteryears, Bayes applications were confined to simple applications. However with the rise of computing power, Bayes applications are now being used to a wide range of complex modeling situations. Is Bayes the master algorithm ? Well, there seems to many critics to the Bayesian approach of modeling. All said and done, it definitely appears that Bayesian inference will be a part of "Master Algorithm" in some way.

The argument from computer science : The author mentions the famous unsolved problem in computer science, P vs. NP. NP-complete are a set of problems that are equivalent in their computational hardness. If you solve one of the problems, you solve the rest. For decades, many mathematicians, researchers and scientists have been finding clever tricks to almost solve NP-complete problems. But the fundamental problem still eludes us -  Is the class of problems for which we can efficiently compute same as the class of problems for which we can efficiently check whether a solution exists ? If you read up on this problem, you will realize that all the NP-complete problems can be reduced to a satisfiability problem. If we invent a learner that can learn to solve satisfiability, it has a good claim to being the Master Algorithm.

NP-completeness aside, the sheer fact that a computer can do a gazillion tasks should make one confident about speculating the presence of a master algorithm that does the job across several problems. The author uses the example of Turing machine and says that back then, it was unthinkable to actually see a Turing machine in action. Turing machine can solve every conceivable problem that can be solved by logical deduction. The fact that we see these machines everywhere means that, despite the odds, we might see a Master Algorithm sometime in the future.

The Master Algorithm is for induction, the process of learning, what the Turing machine is for deduction. It can learn to simulate any other algorithm by reading examples of its input-output behavior. Just as there are many models of computation equivalent to a Turing machine, there are probably many different equivalent formulations of a universal learner. The point, however, is to find the first such formulation, just as Turing found the first formulation of the general-purpose computer.

Interesting analogy : The author is of the opinion that "human intuition" can’t replace data. There have been many instances where human intuition has gone terribly wrong and a guy with lots of data has done better. The author uses Brahe, Kepler and Newton’s work to draw a parallel to the machine learning.

Science goes through three phases, which we can call the Brahe, Kepler, and Newton phases. In the Brahe phase, we gather lots of data, like Tycho Brahe patiently recording the positions of the planets night after night, year after year. In the Kepler phase, we fit empirical laws to the data, like Kepler did to the planets’ motions. In the Newton phase, we discover the deeper truths. Most science consists of Brahe-and-Kepler-like work; Newton moments are rare. Today, big data does the work of billions of Brahes, and machine learning the work of millions of Keplers. If-let’s hope so-there are more Newton moments to be had, they are as likely to come from tomorrow’s learning algorithms as from tomorrow’s even more overwhelmed scientists, or at least from a combination of the two.

Critics of Master Algo : Well, for a concept as ambitious as Master Algo, there are bound to be critics and there are many. The author mentions a few of them as examples,

  • Knowledge engineers
  • Marvin Minsky
  • Naom Chomsky
  • Jerry Fodor

Hedgehog or Fox : One of the other questions that comes up when we think of "Master Algorithm" is whether it is a fox or a hedgehog. There are many studies that have shown that being a fox is far better than being a hedgehog. The hedgehog is synonymous with that of an "expert". In the context of this book though, a learning algorithm can be considered as "hedgehog" if variations of it can solve all the learning problems. The author hopes that the "Master Algorithm" turns out to a hedgehog.

Five tribes of ML : In the quest for master algorithm, we do not have to start from scratch. There are already many decades of ML research underway and each type of research community is akin to a tribe. The author describes five tribes, i.e. symbolists, connectivists, Bayesians, evolutionaries and analogizers. Each of the tribes uses its own master algo. Here is an illustration from the author’s presentation



But the real master algo that the author is hinting at is an algo that combines all the features of the tribes.

The most important feature of each of the tribe is that they firmly believe that theirs is the only way to model and predict. Unfortunately, this thinking hinders their ability to model a broad set of problems. For example, a Bayesian would find it extremely difficult to leave the probabilistic inference method and look at the problem from a evolutionary point of view. His thinking is forged based on priors, posteriors and likelihood functions. If a Bayesian were to look at an evolutionary algorithm like a genetic algo, he might not critically analyze it and adapt it to the problem at hand. This limitation is prevalent across all tribes. Analogizers love support vector machines but there are limited because they look for similarities of inputs across various dimensions; i.e. they are bound to be hit by curse of dimensionality. The same serpent,"the curse of dimensionality" that the author talks about in the previous chapters comes and bites each tribe, depending on the type of problem being solved.

The obvious question that arises in a reader’s mind is, can there be combination of tribes that come together to solve a specific set of problems ? Indeed the tribe categorization is not a hard categorization of the algorithms. It is just meant as a starting point so that you can place the gamut of algos in separate buckets.


Hume’s problem of Induction

The chapter starts with a discussion of "Rationalism vs. Empiricism". The rationalist likes to plan everything in advance before making the first move. The empiricist prefers to try things and see how they turn out. There are philosophers who strongly believe in one and not in the other. From a practical standpoint, there have been productive contributions to our world from both the camps. David Hume is considered to be one of the greatest empiricist of all time. In the context of Machine Learning, one of his questions has hung like a sword of Damocles over all the knowledge, which is,

How can we ever be justified in generalizing from what we’ve seen to what we haven’t?

The author uses a simple example where you have to decide to ask someone out for a date or not. The dataset used in the example illustrates Hume’s problem of induction, i.e. there is no reason to pick one generalization over another. So, a safe way out of the problem, at least to begin with, is to assume that future will be like the past. Is this enough ? Not really. In the ML context, the real problem is : How to generalize to cases that we haven’t seen before. One might think that by amassing huge datasets, you can solve this problem. However once you do the math, you realize that you will run out of data that covers all the cases needed to carry the inductive argument safely. Each new data point is most likely unique and you have no choice but to generalize. According to Hume, there is no way to do it

If this all sounds a bit abstract, suppose you’re a major e-mail provider, and you need to label each incoming e-mail as spam or not spam. You may have a database of a trillion past e-mails, each already labeled as spam or not, but that won’t save you, since the chances that every new e-mail will be an exact copy of a previous one are just about zero. You have no choice but to try to figure out at a more general level what distinguishes spam from non-spam. And, according to Hume, there’s no way to do that.

The "no free lunch" theorem : If you have been reading some general articles in the media on ML and big data, it is likely that you would have come across a view on the following lines:

With enough data, ML can churn out the best learning algo. You don’t have to have strong priors, the fact that you have large data is going to give you all the power to understand and model the world.

The author introduces David Wolpert’s "no free lunch" theorem that a limit on how good a learning algorithm can be. The theorem says that no learner can be better than random guessing. Are you surprised by this theorem ? Here is how one can reconcile to it,

Pick your favorite learner. For every world where it does better than random guessing, I, the devil’s advocate, will deviously construct one where it does worse by the same amount. All I have to do is flip the labels of all unseen instances. Since the labels of the observed ones agree, there’s no way your learner can distinguish between the world and the antiworld. On average over the two, it’s as good as random guessing. And therefore, on average over all possible worlds, pairing each world with its antiworld, your learner is equivalent to flipping coins.

How to escape the above the random guessing limit? Just care about the world we live in and don’t care about alternate worlds. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing. What are the implications of "free lunch theorem" in our modeling world ?

There’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.

Unwritten rule of Machine learning : The author states that the principle laid out by Newton in his work, "Principia", that serves as the first unwritten rule of ML

Whatever is true of everything we’ve seen is true of everything in the universe.

Newton’s principle is only the first step, however. We still need to figure out what is true of everything we’ve seen-how to extract the regularities from the raw data. The standard solution is to assume we know the form of the truth, and the learner’s job is to flesh it out. One of the ways to think about creating a form is via "conjunctive concepts", i.e a series of statements with AND as the bridge. The problem with "conjunctive concepts" is that they are practically useless. Real world is driven by "disjunctive concepts", i.e a concept defined by a set of rules. One of the pioneers in this approach of discovering rules was Ryszard Michalski, a Polish computer scientist. After immigrating to the United States in 1970, he went on to found the symbolist school of machine learning, along with Tom Mitchell and Jaime Carbonell.

Overfitting and Underfitting : The author uses the words "blindness" and "hallucination" to describe underfitting and overfitting models. By using ton of hypothesis, you can almost certainly overfit the data. On the other hand, being sparse in your hypothesis set, you can fail to see the true patterns in the data. This classic problem is obviated by doing out-of-sample testing. Is it good enough ? Well, that’s the best that is available without going in to the muddy philosophical debates or alternative pessimistic approaches like that of Leslie Valiant(author of Probably Approximately Correct).

Induction as inverse of deduction : Symbolists work via the induction route and formulate an elaborate set of rules. Since this route is computationally intensive for large dataset, the symbolists prefer something like decision trees. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. How do we then decide which concept the instance belongs to?

Decision trees are used in many different fields. In machine learning, they grew out of work in psychology. Earl Hunt and colleagues used them in the 1960s to model how humans acquire new concepts, and one of Hunt’s graduate students, J. Ross Quinlan, later tried using them for
chess. His original goal was to predict the outcome of king-rook versus king-knight endgames from the board positions. From those humble beginnings, decision trees have grown to be, according to surveys, the most widely used machine-learning algorithm. It’s not hard to see why: they’re easy to understand, fast to learn, and usually quite accurate without too much tweaking. Quinlan is the most prominent researcher in the symbolist school. An unflappable, down-to-earth Australian, he made decision trees the gold standard in classification by dint of relentlessly improving them year after year, and writing beautifully clear papers about them. Whatever you want to predict, there’s a good chance someone has used a decision tree for it.

The Symbolists : The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols. A mathematician solves equations by moving symbols around and replacing symbols by other symbols according to predefined rules. The same is true of a logician carrying out deductions. According to this hypothesis, intelligence is independent of the substrate.

Symbolist machine learning is an offshoot of the knowledge engineering school of AI. The use of computers to automatically learn the rules made the work of pioneers like Ryszard Michalski, Tom Mitchell, and Ross Quinlan extremely popular and since then the field has exploded

What are the shortcomings of inverse deduction?

  • The number of possible inductions is vast, and unless we stay close to our initial knowledge, it’s easy to get lost in space
  • Inverse deduction is easily confused by noise
  • Real concepts can seldom be concisely defined by a set of rules. They’re not black and white: there’s a large gray area between, say, spam and nonspam. They require weighing and accumulating weak evidence until a clear picture emerges. Diagnosing an illness involves giving more weight to some symptoms than others, and being OK with incomplete evidence. No one has ever succeeded in learning a set of rules that will recognize a cat by looking at the pixels in an image, and probably no one ever will.

An interesting example of a success from Symbolists is Eve, the computer that discovered malaria drug. There was a flurry of excitement a year ago, when an article, titled, Robot Scientist Discovers Potential Malaria Drug was published in Scientific American. This is the kind of learning that Symbolists are gung-ho about.


How does your brain learn ?

This chapter covers the second tribe of the five tribes mentioned in the book. This tribe is called "Connectionists". Connectionists are highly critical about the way Symbolists work as they think that describing something via a set of rules is just the tip of iceberg. There is lot more going under the surface that formal reasoning can’t see. Let’s say you come across the word "love", Symbolists would associate a rule with such a concept whereas Connectionists would associate various parts of the brain to such a concept. In a sense, there is no one to one correspondence between a concept and a symbol. Instead the correspondence is many to many. Each concept is represented by many neurons, and each neuron participates in representing many different concepts. Hebb’s rule is the corner stone of connectionists. In a non-math way, it says that "Neurons that fire together stay together". The other big difference between Symbolists and Connectionists is that the former tribe believes in sequential processing whereas the latter tribe believes in parallel processing.

To get some basic understanding of the key algos used by connectionists, it is better to have a bit of understanding of the way neuron is structured in our brain. Here is a visual that I picked up from the author’s presentation :


The branches of the neuron connect to others via synapses and basic learning takes place via synaptic connections. The first formal model of a neuron was proposed by Warren McCulloch and Walter Pitts in 1943. It looked a lot like the logic gates computers are made of. The problem with this model was that the model did not learn. It was Frank Rosenblatt who came up with the first model of learning by giving variable weights to the connections between neurons. The following is a good schematic diagram of the perceptron:


This model generated a lot of excitement and ML received a lot of funding for various research projects. However this excitement was short lived. Marvin Minsky and few others published many examples where perceptron failed to learn. One of the most simple and dangerous example that perceptron could not learn was XOR operator. Perceptron was mathematically unimpeachable, searing in its clarity, and disastrous in its effects. Machine learning at the time was associated mainly with neural networks, and most researchers (not to mention funders) concluded that the only way to build an intelligent system was to explicitly program it. For the next fifteen years, knowledge engineering would hold center stage, and machine learning seemed to have been consigned to the ash heap of history.

Fast forward to John Hopfield work on spin glasses, there was a reincarnation of perceptron

Hopfield noticed an interesting similarity between spin glasses and neural networks: an electron’s spin responds to the behavior of its neighbors much like a neuron does. In the electron’s case, it flips up if the weighted sum of the neighbors exceeds a threshold and flips (or
stays) down otherwise. Inspired by this, he defined a type of neural network that evolves over time in the same way that a spin glass does and postulated that the network’s minimum energy states are its memories. Each such state has a "basin of attraction" of initial states that converge to it, and in this way the network can do pattern recognition: for example, if one of the memories is the pattern of black-and-white pixels formed by the digit nine and the network sees a distorted nine, it will converge to the "ideal" one and thereby recognize it. Suddenly, a vast body of physical theory was applicable to machine learning, and a flood of statistical physicists poured into the field, helping it break out of the local minimum it had been stuck in.

The author goes on to describe "Sigmoid" function and its ubiquitous nature. If you think about the curve for sometime, you will find it everywhere. I think the first time I came across this function was in Charles Handy’s book, "The Age of Paradox". Sigmoid functions in that book are used to describe various types of phenomenon that show an exponential slow rate of increase in the beginning, then a sudden explosive rate of increase and subsequently with an exponential rate of decrease. Basically if you take the first derivative of the Sigmoid function, you get the classic bell curve. I think the book,"The Age of Paradox" had a chapter with some heavy management gyan that went something like – "you need to create another Sigmoid curve in your life before the older Sigmoid curve starts a downfall" or something to that effect. I don’t quite recollect the exact idea from Charles Handy’s book, but there is a blog post by Bret Simmons, titled The Road to Davy’s Bar that goes in to related details.

Well, in the context of ML, the application of Sigmoid curve is more practical. It can be used to replace the step function and suddenly things become more tractable. A single neuron can learn a straight line but a set of neurons, i.e multi-layer perceptron can learn more convoluted curves. Agreed there is a curse of dimensionality here, but if you think about it, the hyperspace explosion is a double edged sword. On the one hand, there objective function is far more wiggly but on the other hand, there is a less scope that you will stuck at a local minimum via gradient search methods. With this Sigmoid input and multi layer tweak, Perceptron came back with vengeance. There was a ton of excitement just like the time when perceptron was introduced. The algorithm by which the learning takes place is called "back propagation", a term that is analogous to how human brains work. This algo was invented by David Rumelhart in 1986. It is a variant of gradient descent method. There is no mathematical proof that Back propagation will find the global minimum/maximum, though. The backprop solves what the author calls "credit assignment" problem. In a multi-layered perceptron the error between the target value and the current value needs to be propagated across all layers backward. The basic idea of error propagation, i.e error assignment for each of the layers is done via backprop.

Whenever the learner’s "retina" sees a new image, that signal propagates forward through the network until it produces an output. Comparing this output with the desired one yields an error signal, which then propagates back through the layers until it reaches the retina. Based on this returning signal and on the inputs it had received during the forward pass, each neuron adjusts its weights. As the network sees more and more images of your grandmother and other people, the weights gradually converge to values that let it discriminate between the two.

Sadly the excitement phase petered down as learning with dozens of hundreds of hidden layers was computationally difficult. In the recent years though, backpropagation has made a comeback thanks to huge computing power and big data. It now goes by the technique "Deep Learning". The key idea of deep learning is based on auto encoders that is explained very well by the author. However there are many things that need to be worked out for deep learning to be anywhere close to the Master algorithm. All said and done there are a few limitations to exclusively following a connectionist tribe. Firstly, the learning algo is difficult to comprehend. It comprises convoluted connections between various neurons. The other limitation is that the approach is not compositional, meaning it is divorced from the way a big part of human cognition works.


Evolution : Nature’s Learning Algorithm

The chapter starts with the story of John Holland, the first person to have earned a PhD in computer science in 1959. Holland is known for his immense contribution to Genetic algorithms. His key insight lay in coming up with a fitness function that would assign a score to every program considered. What’s the role of fitness function ? Starting with a population of not-very-fit individuals-possibly completely random ones-the genetic algorithm has to come up with variations that can then be selected according to fitness. How does nature do that? This is where the genetic part of the algorithm comes in. In the same way that DNA encodes an organism as a sequence of base pairs, we can encode a program as a string of bits. Variations are produced by crossovers and mutations. The next breakthrough in the field of genetic programming came from Holland’s student John Koza who came up with the idea of evolving full blown computer programs.

Genetic programming’s first success, in 1995, was in designing electronic circuits. Starting with a pile of electronic components such as transistors, resistors, and capacitors, Koza’s system reinvented a previously patented design for a low-pass filter, a circuit that can be used for things like enhancing the bass on a dance-music track. Since then he’s made a sport of reinventing patented devices, turning them out by the dozen. The next milestone came in 2005, when the US Patent and Trademark Office awarded a patent to a genetically designed factory optimization system. If the Turing test had been to fool a patent examiner instead of a conversationalist, then January 25, 2005, would have been a date for the history books. Koza’s confidence stands out even in a field not known for its shrinking violets. He sees genetic programming as an invention machine, a silicon Edison for the twenty-first century.

A great mystery in genetic programming that is yet to be solved conclusively is the role of crossover. None of Holland’s theoretical results show that crossover actually helps; mutation suffices to exponentially increase the frequency of the fittest schemas in the population over time. There were other problems with genetic programming that finally made ML community at large divorce itself from this tribe

Evolutionaries and connectionists have something important in common: they both design learning algorithms inspired by nature. But then they part ways. Evolutionaries focus on learning structure; to them, fine-tuning an evolved structure by optimizing parameters is of secondary importance. In contrast, connectionists prefer to take a simple, hand-coded structure with lots of connections and let weight learning do all the work. This is machine learning’s version of the nature versus nurture controversy. As in the nature versus nurture debate, neither side has the whole answer; the key is figuring out how to combine the two. The Master Algorithm is neither genetic programming nor backprop, but it has to include the key elements of both: structure learning and weight learning. So, is this it ? Have we stumbled on to the right path for "Master Algorithm" ? Not quite. There are tons of problems with evolutionary algos. Symbolists and Bayesians do not believe in emulating nature. Rather, they want to figure out from first principles what learners should do. If we want to learn to diagnose cancer, for example, it’s not enough to say "this is how nature learns; let’s do the same." There’s too much at stake. Errors cost lives. Symbolists dominated the first few decades of cognitive psychology. In the 1980s and 1990s, connectionists held sway, but now Bayesians are on the rise.


In the Church of the Reverend Bayes

Perci Diaconis in his paper titled, MCMC Revolution, says that MCMC technique that came from Bayesian tribe has revolutionized applied mathematics. Indeed, thanks to high performance computing ability, Bayes is now a standard tool in any number cruncher’s tool kit. This chapter talks about various types of Bayesian techniques. The basic idea behind Bayes is that it is a systematic and quantified way of updating degrees of belief, in the light of new data. You can pretty much cast any problem, irrespective of the size of the data available, in to a Bayesian inference problem. Bayes theorem usually goes by the name "inverse probability" because in real life we know Pr(effect|cause) and we are looking to compute Pr(cause|effect). Bayes’ theorem as a foundation for statistics and machine learning is bedeviled not just by computational difficulty but also by extreme controversy. The main point of conflict between Bayesians and Non-Bayesians is the reliance of subjective priors to "turn on the Bayesian crank". Using subjective estimates as probabilities is considered sin by Frequentists, for whom, everything should be learned from the data.

One of the most common variant of Bayesian models is the "Naive Bayes" model where each cause is independent of other causes in creating an effect. Even though this assumption sounds extremely crazy, there are a ton of areas where Naive Bayes beats sophisticated models. No one is sure who invented the Naïve Bayes algorithm. It was mentioned without attribution in a 1973 pattern recognition textbook, but it only took off in the 1990s, when researchers noticed that, surprisingly, it was often more accurate than much more sophisticated learners. Also if you reflect a bit, you will realize that Naive Bayes is closely related to Perceptron algorithm.

The author mentions Markov Models as the next step in the evolution of Bayes models. Markov models are applicable to a family of random variables where each variable is conditionally independent of its history except the current state. Markov chains turn up everywhere and are one of the most intensively studied topics in mathematics, but they’re still a very limited kind of probabilistic model. A more complicated model is Hidden Markov Model where we don’t get to see the actual states but we have to infer them from the observations. A continuous version of HMM goes under the name "Kalman Filter" that has been used in many applications across domains.

Naive Bayes, Markov Models, Hidden Markov Models are all good but they are all a far cry from Symbolists. The next breakthrough came from Judea Pearl who invented Bayesian Networks. This allowed one to specify complex dependencies among random variables. By defining the conditional independence of a variable given a set of neighboring nodes, Bayesian networks tame the combinatorial explosion and make inferences tractable. Basically Bayesian Network can be thought of as a "generative model", a recipe for probabilistically generating a state of the world. Despite the complex nature of a Bayesian net, the author mentions that there have been techniques developed to successfully infer various aspects of the network. In this context, the author mentions MCMC and gives an intuitive explanation of the technique. A misconception amongst many is that MCMC is a simulation technique. Far from it, the procedure does not simulate any real process; rather it is an efficient way to generate samples from a Bayesian network. Inference in Bayesian networks is not limited to computing probabilities. It also includes finding the most probable explanation for the evidence. The author uses the "poster child" example of inferring the probability of heads from coin tosses to illustrate the Bayesian technique and compare it with the Frequentist world of inference.

The next set of models that came to dominate the Bayesian tribe is Markov Networks. A Markov network is a set of features and corresponding weights, which together define a probability distribution. Like Bayesian networks, Markov networks can be represented by graphs, but they have undirected arcs instead of arrows. Markov networks are a staple in many areas, such as computer vision. There are many who feel that Markov networks are far better than Naive Bayes, HMMs etc., as they can capture the influence from surroundings.

Bayesians and symbolists agree that prior assumptions are inevitable, but they differ in the kinds of prior knowledge they allow. For Bayesians, knowledge goes in the prior distribution over the structure and parameters of the model. In principle, the parameter prior could be anything we please, but ironically, Bayesians tend to choose uninformative priors (like assigning the same probability to all hypotheses) because they’re easier to compute with. For structure, Bayesian networks provide an intuitive way to incorporate knowledge: draw an arrow from A to B if you think that A directly causes B. But symbolists are much more flexible: you can provide as prior knowledge to your learner anything you can encode in logic, and practically anything can be encoded in logic-provided it’s black and white.

Clearly, we need both logic and probability. Curing cancer is a good example. A Bayesian network can model a single aspect of how cells function, like gene regulation or protein folding, but only logic can put all the pieces together into a coherent picture. On the other hand, logic can’t deal with incomplete or noisy information, which is pervasive in experimental biology, but Bayesian networks can handle it with aplomb.Combining connectionism and evolutionism was fairly easy: just evolve the network structure and learn the parameters by backpropagation. But unifying logic and probability is a much harder problem.


You are what you resemble

The author introduces techniques of the "Analogizers" tribe. This tribe uses similarities among various data points to categorize them in to distinct classes. In some sense, we all learn by analogy. Every example that illustrates an abstract concept is like an analogy. We learn by relating the similarity between two concepts and then figure what else one can infer based on the fact that two concepts are similar.

The chapter begins with explaining the most popular algorithm of the tribe, "the nearest neighbor algorithm". This was invented way back in 1951 by Evelyn Fix and Joe Hodges. The inventors faced a massive difficulty in publishing their algorithm. However the fact that the algo remained unpublished did not faze many researchers who went about developing variants of the algorithm like "K nearest neighbor" method, "Weighted K-nearest neighbor" etc. It was in 1967 that Tom Cover and Peter Hart proved that, given enough data, nearest-neighbor is at worst only twice as error-prone as the best imaginable classifier. This was a momentous revelation. Up until then, all known classifiers assumed that the frontier had a very specific form, typically a straight line. This was a double-edged sword: on the one hand, it made proofs of correctness possible, as in the case of the perceptron, but it also meant that the classifier was strictly limited in what it could learn. Nearest-neighbor was the first algorithm in history that could take advantage of unlimited amounts of data to learn arbitrarily complex concepts. No human being could hope to trace the frontiers it forms in hyperspace from millions of examples, but because of Cover and Hart’s proof, we know that they’re probably not far off the mark.

Is nearest neighbor algo, the master algorithm ? It isn’t because of curse of dimensionality. As the dimension of covariates goes up, the NN algo efficiency goes down. In fact the curse of dimensionality is the second most important stumbling block in the Machine learning, over-fitting being the first one. There are certain techniques to handle the dimension explosion but most of them are hacks and there is no guarantee that they are going to work.

Subsequently, the author introduces Support Vector Machines(SVM) that has become the most popular technique used by Analogizers. I loved the way author describes this technique using plain simple English. He asks the reader to visualize a fat serpent that moves between two countries that are at war. The story of finding the serpent incorporates pretty much all the math that is needed to compute support vectors, i.e.

  • kernel for SVM
  • support vectors
  • weight of the support vectors
  • constrained optimization
  • maximizing the margin of the classifier

My guess is, one would understand the math far easier, after reading through this section on SVMs. SVMs have many advantages and the author highlights most of them. Books such as these also help us in verbalizing math stuff in simple words. For example, if you were to explain the difference between constrained optimization and unconstrained optimization to a taxi driver, how would you do it? Read this book to check whether your explanation is better than what the author provides.

Towards the end of the chapter, the author talks about case-based reasoning and says that in the years to come, analogical reasoning will become so powerful that it will sweep through all the fields where case-based reasoning is still employed.


Learning without a teacher

Unlike the previous chapters that focused on labeled data, this chapter is learning via unsupervised learning. Cognitive scientists describe theories of child learning using algos and machine learning researchers have developed techniques based on them. The author explains k-means algorithm, a popular clustering technique. It is actually a special case of Expectation Maximization(EM) algorithm that was invented by three Harvard statisticians. EM is used in a ton of places. To learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on them. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM. Once you have a cluster at the macro level, nothing stops you from using the same algo for each cluster and come up with sub-clusters etc.

Subsequently, the author introduces another popular technique for unsupervised learning, PCA, that is used for dimensional reduction. PCA tries to come up linear combination of various dimensions in the hyperspace so that total variance of the data across all dimensions is maximized. A step up to this algo is called "Isomap", a nonlinear dimensionality reduction technique. It connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network and finds the reduced coordinates that best approximate these distances.

After introducing clustering and dimensional reduction techniques, the author talks about "Reinforcement learning", a technique that relies on immediate response of the environment for various actions of the learner. Research on reinforcement learning started in earnest in the early 1980s, with the work of Rich Sutton and Andy Barto at the University of Massachusetts. They felt that learning depends crucially on interacting with the environment, but supervised algorithms didn’t capture this, and they found inspiration instead in the psychology of animal learning. Sutton went on to become the leading proponent of reinforcement learning. Another key step happened in 1989, when Chris Watkins at Cambridge, initially motivated by his experimental observations of children’s learning, arrived at the modern formulation of reinforcement learning as optimal control in an unknown environment. A recent example of a successful startup that combines neural networks and reinforcement learning is "DeepMind", a company that was acquired by Google for half a billion dollars.

Another algorithm that has a potential to be a part of "Master Algorithm" is chunking. Chunking remains a preeminent example of a learning algorithm inspired by psychology. The author gives a basic outline of this concept. Chunking and reinforcement learning are not as widely used in business as supervised learning, clustering, or dimensionality reduction, but a simpler type of learning by interacting with the environment is: A/B testing. The chapter ends with the author explaining another potentially killer algo, "relational learning".


The Pieces of the Puzzle Fall into Place

Progress in science comes from unifying theories; two or more seemingly disparate observations are driven by the same logic or law. If one looks at ML, it appears that Master Algorithm is akin to a unifying theory in science. It will unify all the master algorithms of each tribes, all the techniques of each tribes and give one cohesive way to learn from data.

In fact there is already a technique called "meta learning" that some of the tribes use with in their techniques. For example, bagging, random forests and boosting are some of the famous meta learning techniques used by Symbolists. Bayesians have something called "model averaging" that learns from each model by considering it as an hypothesis and then computes a score based on the vote given by each of the models. Meta learning in its current avatar is remarkably successful, but it’s not a very deep way to combine models. It’s also expensive, requiring as it does many runs of learning, and the combined models can be quite opaque.

The author used the following schematic diagram for each of the tribes, while explaining the rationale of a possible “Master Algorithm”


He then takes the reader through a tour of each of the tribes philosophy and their master algorithms and comes up a unifier, called "Alchemy", that he calls it as the "Master Algorithm". In the process of creating this master algorithm, he introduces Markov Logic Networks and says that they serve for representing the problem. Alchemy uses posterior probability as the evaluation function, and genetic search coupled with gradient descent as the optimizer. The author is wary about Alchemy’s immediate application and says that there is a ton of research that is yet to be done, so that it can become a true Master Algorithm, i.e., one that has the capability to solve hard problems.

This chapter is a little more involved as it tries to connect all the ideas from the previous eight chapters and introduces a way to combine the various pieces of puzzle to create a "Master Algorithm". The chapter will also be very interesting for an aspiring ML researcher  who is trying to pick his focus area.


This is the World on Machine Learning

The last chapter of the book discusses the world in which "Master Algorithm" is all pervasive. The author tries to speculate answers to the following questions :

  • Will humans be replaced by machines ?
  • What do you want the learners of the world to know about you ?
  • How good a model of you, a learner, can have ?
  • How will ecommerce shopping experience be ?
  • Will there be a rise of "humanities" disciple after the automation of most of non-human related tasks ?
  • What will the current online dating sites morph in to ?
  • How will the Amazons, Netflixes, Googles of the world change ?
  • What will be the privacy issues in a society where most of the transactions and activities involve, one algo talking to another algo?
  • Will future wars be fought by robots ?
  • Will robot-warfare be viable ?
  • Will AI and Master Algorithm take over the world ?

The author ends the book by saying,

Natural learning itself has gone through three phases: evolution, the brain, and culture. Each is  product of the previous one, and each learns faster. Machine learning is the logical next stage of this progression. Computer programs are the fastest replicators on Earth: copying them takes  only a fraction of a second. But creating them is slow, if it has to be done by humans. Machine learning removes that bottleneck, leaving a final one: the speed at which humans can absorb change.

takeawayTakeaway :

This book is a pop-science book for Machine Learning. ML has reached a point where it is not just for geeks anymore. Every one needs to know about it, Every one needs to at least have a conceptual model about the field as it has become all pervasive. Having said that, it would take time to plod though the book if you are a complete newbie to ML. This book is massively appealing to someone who a cursory knowledge of a few ML techniques and wants to have a 10,000 ft. view of the entire fields’ past, present and future. The future, as the title of book states is, would be the invention of a master algorithm that unifies methods across all the five tribes.



If one rips apart a computer and looks at its innards, one sees a coalescence of beautiful ideas. The modern computer is built based on electronics but the ideas behind its design has nothing to do with electronics. A basic design can be built from valves/water pipes etc. The principles are the essence of what makes computers compute.

This book introduces some of the main ideas of computer science such as Boolean logic, finite-state machines, programming languages, compilers and interpreters, Turing universality, information theory, algorithms and algorithmic complexity, heuristics, uncomutable functions, parallel computing, quantum computing, neural networks, machine learning, and self-organizing systems.

Who might be the target audience of the book ? I guess the book might appeal to someone who has a passing familiarity of main ideas of computer science and wants to know a bit more, without being overwhelmed. If you ever wanted to explain the key ideas of computing to a numerate high school kid in such a way that it is illuminating and at the same time makes him/her curious, and you are struggling to explain the ideas in simple words, then you might enjoy this book.

Nuts and Bolts

Recollect any algebraic equation you would have come across in your high school. Now replace the variables with logic statements that are either true or false, replace the arithmetic operations with relevant Boolean operators, you get a Boolean algebraic equation. This sort of algebra was invented by George Boole. It was Claude Shannon who showed that one could build electronic circuits that could mirror any Boolean expression. The implication of this construction is that any function capable of being described as a precise logical statement can be implemented by an analogous system of switches.

Two principles of any type of computer involves 1) reducing a task to a set of logical functions and 2) implementing logical functions as a circuit of connected switches. In order to illustrate these two principles, the author narrates his experience of building tic-tac-toe game with a set of switches and bulbs. If you were to manually enumerate all the possible combinations of a 2 player game, one of the best representations of all the moves is via a decision tree. This decision tree will help one choose the right response for whatever be your opponent’s move. Traversing through the decision tree means evaluating a set of Boolean expressions. Using switches in series or parallel, it is possible to create a automated response to a player’s moves. The author describes briefly the circuit he built using 150 odd switches. So, all one needs to construct an automated tic-tac-toe game is switches, wires, bulbs and a way to construct logical gates. There are two crucial elements missing from the design. First is that the circuit has no concept of events happening over time; therefore the entire sequence of game must be determined in advance. The second aspect that is missing is that the circuit can perform only one function. There is no software in it.

Early computers were made with mechanical components. One can represent AND, OR, NOR, etc. using a set of mechanical contraptions that can then be used to perform calculations. Even in 1960’s most of the arithmetic calculators were mechanical.

Building a computer out of any technology requires two components, i.e.

  • switches : this is the steering element which can combine multiple signals in to one signal
  • connectors : carries signal between switches

The author uses hydraulic valves, tinker toys, mechanical contraptions, electrical circuits to show the various ways in which the Boolean logic can be implemented. The basic takeaway from this chapter is the principle of functional abstraction. This process of functional abstraction is a fundamental in computer design-not the only way to design complicated systems but the most common way. Computers are built up on a hierarchy of such functional abstractions, each one embodied in a building block. Once someone implements a 0/1 logic, you built stuff over it. The blocks that perform functions are hooked together to implement more complex functions, and these collections of blocks in turn become the new building blocks for the next level.

Universal Building blocks

The author explains the basic mechanism behind translating any logical function in to a circuit. One needs to break down the logical function in to parts, figure out the type of gates that are required to translate various types of input to the appropriate output of a logical function. To make this somewhat abstract principle more concrete, the author comes up with a circuit design for "Majority wins" logical block. He also explains how one could design a circuit for “Rocks-Paper-Scissors” game. The learnings from these simple circuits are then extrapolated to the general computer. Consider an operation like addition or multiplication; you can always represent it as a Boolean logic block. Most computers have logical blocks called arithmetic units that perform this job.

Hence the core idea behind making a computer perform subtraction, addition, multiplication, or whatever computation is to write out the Boolean logic on a piece of paper and then implement the same using logical gates.

The most important class of functions are time varying functions, i.e. the output depends on the previous history of inputs. These are handled via a finite-state machine. The author describes the basic idea of finite-state machine via examples such as as ball point pen, combination lock, tally counter, odometer etc. To store this finite-state machine, one needs a device called register. An n-bit register has n inputs and n outputs, plus an additional timing input that tells the register when to change state. Storing new information is called "writing" the state of the register. When the timing signal tells the register to write a new state, the register changes its state to match the inputs. The outputs of the register always indicate its current state. Registers can be implemented in many ways, one of which is to use a Boolean logic block to steer the state information around in a circle. This type of register is often used in electronic computers, which is why they lose track of what they’re doing if their power is interrupted.

A finite-state machine consists of a Boolean logic block connected to a register. The finite-state machine advances its state by writing the output of the Boolean logic block into the register; the logic block then computes the next state, based on the input and the current state. This next state is then written into the register on the next cycle. The process repeats in every cycle. The machine on which I am writing this blog post is 1.7GHz machine, i.e. the machine can change its state at a rate of 1.7billion times per second. To explain the details of a finite state machine and its circuit implementation, the author uses familiar examples like "traffic lights", "combination lock". These examples are superbly explained and that’s the beauty of this book. Simple examples are chosen to illustrate profound ideas.

Finite state machines are powerful but limited. They cannot be used for stating many patterns that are common in our world. For instance, it is impossible to build a finite-state machine that will unlock a lock whenever you enter any palindrome. So, we need something else too besides logic gates and finite state machines.


Boolean logic and finite-state machine are the building blocks of computer hardware. Programming language is the building block of computer software. There are many programming languages and if you learn one or two, you can quickly pick up the others. Having said, writing code to perform something is one thing and writing effective code is completely a different thing. The latter takes many hours of deliberate effort. This is aptly put by the author,

Every computer language has its Shakespeares, and it is a joy to read their code. A well-written computer program possesses style, finesse, even humor-and a clarity that rivals the best prose.

There are many programming languages and each has its own specific syntax.The syntax in these languages are convenient to write as compared to machine level language instructions. Once you have a written a program, how does the machine know what to do ? There are three main steps :

  1. a finite-state machine can be extended, by adding a storage device called a memory, which will allow the machine to store the definitions of what it’s asked to do
  2. extended machine can follow instructions written in machine language, a simple language that specifies the machine’s operation
  3. machine language can instruct the machine to interpret the programming language

A computer is just a special type of finite-state machine connected to a memory. The computer’s memory-in effect, an array of cubbyholes for storing data-is built of registers, like the registers that hold the states of finite-state machines. Each register holds a pattern of bits called a word, which can be read (or written) by the finite-state machine. The number of bits in a word varies from computer to computer. Each register in the memory has a different address, so registers are referred to as locations in memory. The memory contains Boolean logic blocks, which decode the address and select the location for reading or writing. If data is to be written at this memory location, these logic blocks store the new data into the addressed register. If the register is to be read, the logic blocks steer the data from the addressed register to the memory’s output, which is connected to the input of the finite-state machine.Memory can contain data, processing instructions, control instructions. These instructions are stored in machine language. Here is the basic hierarchy of functional dependence over various components:

  1. Whatever we need the computer to perform, we write it in a programming language
  2. This is converted in to machine language by a compiler, via a predetermined set of subroutines called operating system
  3. The instructions are stored in a memory. These are categorized in to control and processing instructions.
  4. Finite state machines fetch the instructions and execute the instructions
  5. The instructions as well as data are represented by bits, are stored in the memory
  6. Finite state machines and memory are built based on storage registers and Boolean blocks
  7. Boolean blocks are implemented via switches in series or parallel
  8. Switches control something physical that sends 0 or 1

If you look at these steps, each idea is built up on a a level of abstraction. In one sense, that’s why any one can write simple software programs without understanding a lot of details about the computer architecture. However the more one goes gets closer to the machine language, the more he needs to understand the details of various abstractions that have been implemented.

How universal are Turing machines ?

The author discusses the idea of universal computer, that was first described in 1937 by Alan Turing. What’s a Turing machine ?

Imagine a mathematician performing calculations on a scroll of paper. Imagine further that the scroll is infinitely long, so that we don’t need to worry about running out of places to write things down. The mathematician will be able to solve any solvable computational problem no matter how many operations are involved, although it may take him an inordinate amount of time.

Turing showed that any calculation that can be performed by a smart mathematician can also be performed by a stupid but meticulous clerk who follows a simple set of rules for reading and writing the information on the scroll. In fact, he showed that the human clerk can be replaced by a finite-state machine. The finite-state machine looks at only one symbol on the scroll at a time,
so the scroll is best thought of as a narrow paper tape, with a single symbol on each line. Today, we call the combination of a finite-state machine with an infinitely long tape a Turing machine

The author also gives a few example of noncomputable problems such as "Halting problem". He also touches briefly upon quantum computing and explains the core idea using water molecule

When two hydrogen atoms bind to an oxygen atom to form a water molecule, these atoms somehow "compute" that the angle between the two bonds should be 107 degrees. It is possible to approximately calculate this angle from quantum mechanical principles using a digital computer, but it takes a long time, and the more accurate the calculation the longer it takes. Yet every molecule in a glass of water is able to perform this calculation almost instantly. How can a single molecule be so much faster than a digital computer?

The reason it takes the computer so long to calculate this quantum mechanical problem is that the computer would have to take into account an infinite number of possible configurations of the water molecule to produce an exact answer. The calculation must allow for the fact that the atoms comprising the molecule can be in all configurations at once. This is why the computer can only approximate the answer in a finite amount of time.

One way of explaining how the water molecule can make the same calculation is to imagine it trying out every possible configuration simultaneously-in other words, using parallel processing. Could we harness this simultaneous computing capability of quantum mechanical objects to produce a more powerful computer? Nobody knows for sure.

Algorithms and Heuristics

The author uses simple examples to discuss various aspects of any algorithm like designing an algorithm, computing the running time of an algorithm etc. For many of the problems where precise algorithms are not available, the next best option is to use heuristics. Designing heuristics is akin to art. Most of the real life problems requires one to come up with a healthy mix of algorithmic solutions and heuristics based solutions. IBM Deep Blue is an amazing example that shows the intermixing of algorithms and heuristics can beat one of the best human minds. Indeed there will be a few cognitive tasks that will be out of computer’s reach and humans have to focus on skills that are inherently human. A book length treatment is given to this topic by Geoff Colvin in his book (Humans are underrated)

Memory : Information and Secret codes

Computers do not have infinite memory. There needs a way to measure the information stored in the memory. An n bit memory can store n bits of data. But we need to know how many bits are required to store a certain form of input. One can think of various ways for doing so. One could use the "representative" definition, i.e think about how each character in the input is represented on a computer and then assess the total number of bits required to represent the input in a computer. Let’s say this blog post has 25000 characters and each character takes about 8 bits on my computer, then the blog post takes up 0.2 million bits. 8 bits for each character could be a stretch. May be all the characters in this post might require only 6 bits per character. Hence there would 0.15 million bits required to represent this post. The problem with this kind of quantifying information is that it is "representation" dependent. Ideal measure would be the minimum number of bits needed to represent this information. Hence the key question here is, "How much can you compress a given text without losing information?".

Let’s say one wants to store the text of "War and Peace", the information size obtained by multiplying 8 bits times number of characters in the novel would give a upper bound. By considering various forms of compression such as 6 bit encoding, taking advantage in regularities of data, take advantage of grammar associated with the language, one can reduce the information size of the novel. In the end, the compression that uses the best available statistical methods would probably reach an average representation size of fewer than 2 bits per character-about 25 percent of the standard 8-bit character representation.

If the minimum number of bits required to represent the image is taken as a measure of the amount of information in the image, then an image that is easy to compress will have less information. A picture of a face, for example, will have less information than a picture of a pile of pebbles on the beach, because the adjacent pixels in the facial image are more likely to be similar. The pebbles require more information to be communicated and stored, even though a human observer might find the picture of the face much more informative. By this measure, the picture containing the most information would be a picture of completely random pixels, like the static on a damaged television set. If the dots in the image have no correlation to their neighbors, there is no regularity to compress. So, pictures that are totally random require lot many bits and hence contain lot of information. This go against our notion of information. In common parlance, a picture with random dots should have less information that a picture with some specific dot pattern. Hence it is important for computers to store meaningful information. Indeed that’s how many image and sound compression algos work. They discard meaningless information. Another level of generalization of this idea is to consider a program that can generate the data that is being stored. This leads us to another measure of information:

The amount of information in a pattern of bits is equal to the length of the smallest computer program capable of generating those bits.

This definition of information holds whether the pattern of bits ultimately represents a picture, a sound, a text, a number, or anything else.

The second part of this chapter talks about public-key private key encryption and error correction mechanisms. The author strips off all the math and explains it in plain simple English.

Speed: Parallel Computers

"Parallel computing" is a word that is tossed around at many places. What is the basic problem with a normal computer that parallel computing aims to solve? If you look at the basic design of a computer, it has always been that processing and memory were considered two separate components, and all the effort was directed towards increasing the processing speed. If you compare the modern silicon chip based computer to let’s say the vintage room filled computer, the basic two-part design has remained the same, i.e. processor connected to memory design. This has come to be known as sequential computer and has given the need for parallel computing. To work any faster, today’s computers need to do more than one operation at once. We can accomplish this by breaking up the computer memory into lots of little memories and giving each its own processor. Such a machine is called a parallel computer. Parallel computers are practical because of the low cost and small size of microprocessors. We can build a parallel computer by hooking together dozens, hundreds, or even thousands of these smaller processors. The fastest computers in the world are massively parallel computers, which use thousands or even tens of thousands of processors.

In the early days of parallel computing, many people were skeptical about it for a couple of reasons, the main being Amdahl’s law. It stated that no matter how many parallel processor you use, there will be at least 10% of the task that needs to be done sequentially and hence the task completion time time does not do down as rapidly as one uses more and more processors. Soon it was realized that most of the tasks had negligible amount of sequential component. By smart design, one could parallelize many tasks.

Highly parallel computers are now fairly common. They are used mostly in very large numerical calculations (like the weather simulation) or in large database calculations, such as extracting marketing data from credit card transactions. Since parallel computers are built of the same parts as personal computers, they are likely to become less expensive and more common with time. One of the most interesting parallel computers today is the one that is emerging almost by accident from the networking of sequential machines. The worldwide network of computers called the Internet is still used primarily as a communications system for people. The computers act mostly as a medium-storing and delivering information (like electronic mail) that is meaningful only to humans.Already standards are beginning to emerge that allow these computers to exchange programs as well as data. The computers on the Internet, working together, have a potential computational capability that far surpasses any individual computer that has ever been constructed.

Computers that learn and adapt

The author gives some basic ideas on which adaptive learning has been explored via algorithms. Using the example of neutral networks, perceptron, etc. the author manages to explain the key idea of machine learning.

Beyond Engineering

Brain cannot be analyzed via the usual "divide and conquer" mechanism that is used to understand "sequential computer". As long as the function of each part is carefully specified and implemented, and as long as the interactions between the parts are controlled and predictable, this system of "divide and conquer" works very well, but an evolved object like the brain does not necessarily have this kind of hierarchical structure. The brain is much more complicated than a computer, yet it is much less prone to catastrophic failure. The contrast in reliability between the brain and the computer illustrates the difference between the products of evolution and those of engineering. A single error in a computer’s program can cause it to crash, but the brain is usually able to tolerate bad ideas and incorrect information and even malfunctioning components. Individual neurons in the brain are constantly dying, and are never replaced; unless the damage is severe, the brain manages to adapt and compensate for these failures. Humans rarely crash.

So, how does on go about designing something that is different from "engineering"? The author illustrates this via "sorting numbers" example in which one can go about as follows

  • Generate a "population" of random programs
  • Test the population to find which programs are the most successful.
  • Assign a fitness score to each program based on how successfully they sort the numbers
  • Create new populations descended from the high-scoring programs. One can think of many ways here; only the fittest survive / "breed" new programs by pairing survivors in the previous generation
  • When the new generation of programs is produced, it is again subjected to the same testing and selection procedure, so that once again the fittest programs survive and reproduce. A parallel computer will produce a new generation every few seconds, so the selection and variation processes can feasibly be repeated many thousands of times. With each generation, the average fitness of the population tends to increase-that is, the programs get better and better at sorting. After a few thousand generations, the programs will sort perfectly.

The author confesses that the output from the above experiments works, but it is difficult to "understand" why it works.

One of the interesting things about the sorting programs that evolved in my experiment is that I do not understand how they work. I have carefully examined their instruction sequences, but I do not understand them: I have no simpler explanation of how the programs work than the instruction sequences themselves. It may be that the programs are not understandable-that there is no way to break the operation of the program into a hierarchy of understandable parts. If
this is true-if evolution can produce something as simple as a sorting program which is fundamentally incomprehensible-it does not bode well for our prospects of ever understanding the human brain.

The chapter ends with a discussion about building a thinking machine.

takeawayTakeaway :

The book explains the main ideas of "computer science" in simple words. Subsequently, it discusses many interesting areas that are currently being explored by researchers and practitioners. Anyone who is curious about things like "How do computers work ?", "What are the limitations of the today’s computer hardware and software ?", etc. and wants to get some idea about them, without getting overwhelmed by the answers, will find this book interesting.


For those of us who are born before 1985, it is likely that we have seen two worlds; one, a world that wasn’t dependent on net and another, where our lives are dominated/controlled by the web and social media. The author says that that given this vantage point, we have a unique perspective of how things have changed. It is impossible to imagine a life without print. However before the 1450’s Guttenberg printing press invention, the knowledge access was primarily through oral tradition. Similarly may be a decade or two from now, our next generation would be hard-pressed to imagine a life without connectivity. There is a BIG difference between the Guttenberg’s revolution and Internet—the pace. Even though the printing press was invented around 1450’s, it was not until 19th century that enough people were literate, for the written word to influence the society. In contrast, we have seen both the offline and online world in less than one life time.

We are in a sense a straddle generation, with one foot in the digital pond and the other on the shore, and are experiencing a strange suffering as we acclimatize. In a single generation, we are the only people in the history experiencing massive level of change. The author is not critical about technology, after all technology is neither good nor bad. It is neutral. Firstly something about the word in  title—Absence. It is used as a catch-all term for any activity that does not involve internet, mobile, tablets, social media etc. Given this context, the author structures the content of the book in to two parts. The first part of the book explores certain aspects of our behavior that have dramatically changed and we can see the consequences of it all around us. The second part of the book is part reflection, part experimentation by the author to remain disconnected in this hyper connected world. In this post, I will summarize a few sections of the book.

Kids these days

Increasingly the kids are living in a world where the daydreaming silences in the lives are filled by social media notifications and burning solitudes are extinguished by constant yapping on the social networks/phones and playing video games. That said, what is role of a parent? The author argues that we have a responsibility of providing enough offline time to children.

How often have you seen a teenager staring outside a window and doing nothing but be silent? In all likelihood parents might think that there is something wrong with their kid—he had a fight over something with a sibling, something in the class upset him/her, someone has taunted their kid etc. If the kid is typing something on his mobile or talking over the phone texting, playing a video game, the standard reaction of a parent is – Kids these days—and leave it at that. Instead of actively creating an atmosphere where downtime becomes a recurring event, parents are shoving technology on to kids to escape the responsibility. How else can one justify this product (iPotty)?


One digital critic says, “It not only reinforces unhealthy overuse of digital media, it’s aimed at toddlers. We should NOT be giving them the message that you shouldn’t even take your eyes off a screen long enough to pee.”

Many research studies have concluded that teenagers are more at ease with technologies than one another. The author argues that parents should be aware of subtle cues and create engineered absences for them to develop empathy for others via the real world interactions than avatars in the digital world.

Montaigne once wrote, “We must reserve a back shop, all our own, entirely free, in which to establish our real liberty and our principal retreat and solitude.” But where will tomorrow’s children set up such a shop, when the world seems to conspire against the absentee soul?


The author mentions the Amanda Todd incident—a teenager posts a YouTube video about her online bully and commits suicide. Online bullying is a wide spread phenomena but the social technologies offer a weak solution to the problem—flag this as inappropriate / block the user. Though the crowd sourced moderation may look like a sensible solution, in practice, it is not. The moderation team for most of the social media firms cannot handle the amount of flagging requests. The author mentions about big data tools being developed that take in all the flagging request streams and then decide the appropriate action. The reduction of our personal lives to mere data does run the risk of collapsing things into a Big Brother scenario, with algorithms scouring the Internet for “unfriendly” behavior and dishing out “correction” in one form or another. Do we want algorithms to abstract, monitor and quantify us? Well, if online bullying can be reduced by digital tools, so be it, even though it smells like a digit band-aid to cure problems in digital world. The author is however concerned with “broadcast” culture that has suddenly been thrust upon us.

When we make our confessions online, we abandon the powerful workshop of the lone mind, where we puzzle through the mysteries of our own existence without reference to the demands of an often ruthless public.

Our ideas wilt when exposed to scrutiny too early—and that includes our ideas about ourselves. But we almost never remember that. I know that in my own life, and in the lives of my friends, it often seems natural, now, to reach for a broadcasting tool when anything momentous wells up. Would the experience not be real until you had had shared it, confessed your “status”?

The idea that technology must always be a way of opening up the world to us, of making our lives richer and never poorer, is a catastrophic one. But the most insidious aspect of this trap is the way online technologies encourage confession while simultaneously alienating the confessor.


The author wonders about the “distance” that any new technology creates between a person and his/her “direct experience”. Maps made “information obtained by exploring a place” less useful and more time consuming, as an abstract version of the same was convenient. Mechanical clock regimented the leisurely time and eventually had more control on you than body’s own inclinations. So are MOOCS that take us away from directly experiencing a teacher’s lesson in flesh and blood. These are changes in our society and possibly irrevocable. The fact that “Selfie stick makes to the Time magazine’s list of 25 best inventions of 2014” says that there is some part of us that wants to share the moment, than actually experiencing it. Daniel Kahneman in one of his interviews talks about the riddle of experiencing self vs. remembering self.

Suppose you go on a vacation and, at the end, you get an amnesia drug. Of course, all your photographs are also destroyed. Would you take the same trip again? Or would you choose one that’s less challenging? Some people say they wouldn’t even bother to go on the vacation. In other words, they prefer to forsake the pleasure, which, of course, would remain completely unaffected by its being erased afterwards. So they are clearly not doing it for the experience; they are doing it entirely for the memory of it.

Every person I guess in today’s world is dependent on digital technologies. Does it take us away from having an authentic or more direct experience? Suppose your cell phone and your internet are taken away from you over the weekend, can you lead a relaxed / refreshing weekend? If the very thought of such a temporary 2 day absence gives you discomfort, then one must probably relook at the very idea of what it means to have an authentic experience. Can messaging a group on WhatsApp count as an authentic “being with a friend” experience? All our screen time, our digital indulgence, may well be wreaking havoc on our conception of the authentic. Paradoxically, it’s the impulse to hold more of the world in our arms that leaves us holding more of reality at arm’s length.

The author mentions about Carrington Event

On September 1, 1859, a storm on the surface of our usually benevolent sun released an enormous megaflare, a particle stream that hurtled our way at four million miles per hour. The Carrington Event (named for Richard Carrington, who saw the flare first) cast green and copper curtains of aurora borealis as far south as Cuba. By one report, the aurorae lit up so brightly in the Rocky Mountains that miners were woken from their sleep and, at one a.m., believed it was morning. The effect would be gorgeous, to be sure. But this single whip from the sun had devastating effects on the planet’s fledgling electrical systems. Some telegraph stations burst into flame.

and says such an event, according to experts, is 12% probable in the next decade and 95% probable in the next two centuries. What will happen when such an event happens?

Breaking Away

This chapter narrates author’s initial efforts to seek the absence. In a way, the phrase “seeking the absence” is itself ironical. If we don’t seek anything and be still, aren’t we in absence already? Not really if our mind is in a hyperactive state.

One can think of many things that demand a significant time investment, well may be, an uninterrupted time investment, to be precise. In my life, there are a couple of such activities – reading a book, understanding a concept in math/stat, writing a program, playing an instrument. One of the first difficulties in pursuing these tasks is, “How does one go about managing distractions, be it digital / analog distractions”? About the digital distractions- constantly checking emails/ whatsapp/twitter makes it tough to concentrate on a task that necessitates full immersion.

Why does our brain want to check emails/messages so often? What makes these tools addictive? It turns out the answer was given way back in 1937 by the psychologist, B.F Skinner who describes the behavior as “operant conditioning”. Studies show that constant, reliable rewards do not produce the most dogged behavior; rather, it’s sporadic and random rewards that keep us hooked. Animals, including humans, become obsessed with reward systems that only occasionally and randomly give up the goods. We continue the conditioned behavior for longer when the reward is taken away because surely, surely, the sugar cube is coming up next time. So, that one meaningful email once in a while keeps us hooked on “frequent email checking” activity.Does trading in the financial markets in search of alpha, an outcome of operant conditioning? The more I look at the traders who keep trading despite poor performance, the more certain I feel it is. The occasional reward makes them hooked to trading, despite having a subpar performance.

Try reading a book and catch yourself how many times you start thinking about – what else could I be doing now/ reading now?—Have we lost the ability to remain attentive to a given book or task without constantly multi-tasking. BTW research has proven beyond doubt that there is nothing called multitasking. All we do is mini-tasking. It definitely happens to me quite a number of times. When I am going through something that is really tough to understand in an ebook ( mostly these days, the books I end up reading are in ebook format as hardbound editions of the same are beyond my budget), I click on ALT+TAB –the attention killing combination on a keyboard that takes me from a situation where I have to actively focus on stuff for understanding TO a chrome/Firefox tab where I can passively consume content, where I can indulge in hyperlink hopping , wasting time and really not gaining anything. Over the years, I have figured out a few hacks that alerts me of this compulsive “ALT+TAB” behavior. I cannot say I have slayed ALT+TAB dragon for good at least I have managed to control it.

The author narrates his experience of trying to read the book, “War and Peace” , a thousand page book, amidst his hyper-connected world. He fails to get past in the initial attempts as he finds himself indulging in the automatic desires of the brain.

I’ve realized now that the subject of my distraction is far more likely to be something I need to look at than something I need to do. There have always been activities—dishes, gardening, sex, shopping—that derail whatever purpose we’ve assigned to ourselves on a given day. What’s different now is the addition of so much content that we passively consume.

Seeking help from Peter Brugman(18 minutes), he allots himself 100 pages of “War and Peace” each day with ONLY three email check-ins in a day. He also explores the idea of using a software that might help him in controlling distractions (Dr. Sidney D’Mello , a Notre Dame professors, is creating a software that tracks real-time attention of a person and sounds off an alarm) . In the end, the thing that helps the author complete “War and Peace” is the awareness of lack of absence which makes him find period of absence when he can immerse himself. I like the way he describes this aspect,

As I wore a deeper groove into the cushions of my sofa, so the book I was holding wore a groove into my (equally soft) mind.

There’s a religious certainty required in order to devote yourself to one thing while cutting off the rest of the world—and that I am short on. So much of our work is an act of faith, in the end. We don’t know that the in-box is emergency-free, we don’t know that the work we’re doing is the work we ought to be doing. But we can’t move forward in a sane way without having some faith in the moment we’ve committed to. “You need to decide that things don’t matter as much as you might think they matter,”

Does real thinking require retreat? The author thinks so and cites the example of John Milton who took a decade off to read, read, and read, at a time when his peers were DOING and ACCOMPLISHING stuff. Did he waste his time? Milton, after this retreat, wrote “Paradise Lost”, work, a totemic feat of concentration. Well this example could be a little too extreme for a normal person to take up. But I think we can actively seek mini retreats in a day/week/month/year. By becoming oblivious to thoughts like, “what others are doing?”, “what else should I be doing right now?”, “what could be the new notification on my mobile/desktop related to?” , I guess we will manage to steal mini-retreats in our daily lives.


The word memory evokes, at least amongst many of us, a single stationary cabinet that we file everything from whose retrieval is at best partial( for the trivia that goes on in our lives). This popular notion has been totally invalidated by many experiments in neuroscience. Brenda Milner’s study of Patient M is considered a landmark event in neuroscience as it established that our memory is not a single stationary cabinet that we file everything. Motor memory and Declarative memory reside in different parts of brain. Many subsequent experiments have established that human memory is a dynamic series of systems, with information constantly moving between. And changing.

Why does the author talk about memory in this book? The main reason is that we are relying more and more on Google, Wikipedia and digital tools for storing and looking up information. We have outsourced “transactive memory” to these services. In this context , the author mentions about timehop , a service that reminds you what you were doing a year ago after aggregating content from your online presence on Facebook, twitter, and blogs. You might think this is a cool thing where timehop keeps track of your life. However there is a subtle aspect that is going behind such services. We are tending to offload our memory to digital devices. Isn’t it good that I don’t have to remember all the trivia of life AND at the same time have it at the click of a button? Why do we need memory at all when whatever ALL we need is at a click of a button ? There is no harm in relying on these tools. The issue however, is that you cannot equate effective recall of information to “human memory”. Memorization is the act of making something “a property of yourself,” and this is in both senses: The memorized content is owned by the memorizer AND also becomes a component of that person’s makeup. If you have memorized something, the next time you try to access the memory, you have a new memory of it. Basically accessing memory changes memory. This is fundamentally different from “externalized memory”. In a world where we are increasingly finding whatever we need online, “having a good memory” or “memorization” skills might seem a useless skill. This chapter argues that it isn’t.

How to absent oneself?

This chapter is about author’s experience of staying away from digital world for one complete month—he fondly calls it “Analog August”. He dutifully records his thoughts each day. At the end of one full month of staying away, he doesn’t have an epiphany or something to that effect. Neither does he have some breakthrough in his work. When he resumes his life after the month long sabbatical, he realizes one thing—Every hour, Every day we choose and allow devices/services/technologies in to our lives. By prioritizing them and being aware of them is winning half the battle. By consciously stepping away from each of these connections on a daily basis is essential to get away from their spell.


How does it feel to be the only people in the history to know life with and without internet ? Inevitably the internet is rewiring our brains. Internet does not merely enrich our experience, it is becoming our experience. By thinking about the various aspects that are changing, we might be able to answer two questions: 1) What will we carry forward as a straddle generation? 2) What worthy things might we thoughtlessly leave behind as technology dissolves in to the very atmosphere of our lives? . Michael Harris has tried answering these questions and in the process has written a very interesting book. I have thoroughly enjoyed reading this book amidst complete silence. I guess every straddle generation reader will relate to many aspects mentioned in book.


In today’s world where access to information is being democratized like never before, “learning how to learn” is a skill that will play a key role in one’s academic and professional accomplishments. This book collates ideas from some of the recent books on learning such as, “Make it Stick”, “A Mind for Numbers”, “The Five Elements of Effective Thinking”, “Mindset”, etc. The author has added his own personal take on the various research findings mentioned in the book and has come up with a 250 page book. If one has really absorbed the concepts mentioned in the previous books, then you really DO want to read this book. Any exercise that puts you in retrieval mode of certain concepts alters your memory associated with those specific concepts. Hence even though this is book serves as a content aggregator of all the previous books, reading it from the eyes of a new person, changes the way we store and retrieve memories of the main principles behind effective learning.

Broaden the Margins

The book starts with the author narrating his own college experience, one in which standard learning techniques like “find a quiet place to study”, “practice something repeatedly to attain mastery”, “take up a project and do not rest until it is finished” were extremely ineffective. He chucks this advice and adopts an alternative mode of learning. Only later in his career as a science journalist, does he realize that some of the techniques he had adopted during his college days were actually rooted in solid empirical research. Researchers over the past few decades have uncovered techniques that remain largely unknown outside scientific circles. The interesting aspect of these techniques is that they run counter to the learning advice that we have all taken at some point in our lives. Many authors have written books/blog posts to popularize these techniques. The author carefully puts all the main learning techniques in a format that is easy to read, i.e. he strips away the academic jargon associated with the techniques. The introductory chapter gives a roadmap to the four parts of the book and preps the reader’s mind to look out for the various signposts in the “learning to learn” journey.

The Story maker

A brief look at the main players in our brain:


Labeled in the figure are three areas. The entorhinal cortex acts as filter for the incoming information, the hippocampus is the area where memory formation begins and neocortex is the area where conscious memories are stored. It was H.M, the legendary case study that helped medical research community and doctors give a first glance in to the workings of the brain. Doctors removed hippocampus from H.M’s brain essentially removing the ability to form long term memories. Many amazing aspects of brain were revealed by conducting experiments on H.M. One of them being motor skills like playing music, driving a car are not dependent on hippocampus. This meant that memories were not uniformly distributed and brain had specific areas that handled different types of memory. H.M had some memories of his past after removal of hippocampus. This means that there were long term memories residing in some part of the brain. The researchers then figured out that the only candidate left in the brain where memories could be stored was the neocortex. The neocortex is the seat of human consciousness, an intricate quilt of tissue in which each patch has a specialized purpose.


To the extent that it’s possible to locate a memory in the brain, that’s where it resides: in neighborhoods along the neocortex primarily, not at any single address. This is as far as storage is concerned. How is retrieval done? Again a set of studies on epilepsy patients revealed that the left brain weaves the story based on the sensory information. The left hemisphere takes whatever information it gets and tells a tale to the conscious mind. Which part of the left brain tells this story? There is no conclusive evidence on this. The only thing known is that this interpreter module is present somewhere in the left hemisphere and it is vital to forming a memory in the first place. The science clearly establishes one thing: The brain does not store facts, ideas and experiences like a computer does, as a file that is clicked open, always displaying the identical image. It embeds them in a network of perceptions, facts and thoughts, slightly different combinations of which bubble up each time. No memory is completely lost but any retrieval of memory fundamentally alters it.

The Power of Forgetting

This chapter talks about Herbin Ebbinghaus and Philip Boswood Ballard who were the first to conduct experiments relating to memory storage and retrieval. Ebbinghaus tried to cram 2300 nonsense words and figured out how long it would take to forget them.


The above is probably what we think of memory. Our retention rate of anything falls as time goes. Philip Boswood Ballard on the other hand was curious to see what can be done to improve learning. He tested his students in the class at frequent intervals and found that testing increased their memory and made them better learners. These two experiments were followed by several other experiments and finally Bjorks of UCLA shepherded the theory to give it a concrete direction. They coined their theory as “Forget to Learn”. Any memory has two kinds of strengths, storage strength and retrieval strength. Storage strength builds up steadily and grows with usage of time. Retrieval strength on the other hand is a measure of how quickly a nugget of information comes to mind. It increases with studying and use. Without reinforcement, retrieval strength drops off quickly and its capacity is relatively small. The surprising thing about retrieval strength is this: the harder we work at retrieving something, the greater is the subsequent spike in retrieval and storage strength. Bjorks call this “desirable difficulty”. This leads to the key message of the chapter, “Forgetting is essential for learning”

Breaking Good Habits

This chapter says that mass practice does not work as well as randomized practice. Finding a particular place to do your work and working on just one thing till you master, and then proceeding on to the next , is what we often hear an advice for effective learning. This chapter says that by changing the study environment randomly, randomly picking various topics to study gives a better retrieval memory than the old school of thought.

Spacing out

This chapter says that spacing out any learning technique is better than massed practice. If you are learning anything new, it is always better to space it out than cram everything at one go. This is the standard advice – Do not study all at once. Study a bit daily. But how do we space out the studies? What is the optimal time to revisit something that you have read already? Wait for too long a time, the rereading will sound as a completely new material. Wait for too less a time, your brain gets bored because of familiarity. This chapter narrates the story of Piotr Wozniak, who tackled this problem of “how to space your studies?” and eventually created SuperMemo, a digital flashcard software which is used by many people to learn foreign languages. Anki, an open source version of SuperMemo is another very popular way to inculcate spaced repetition in your learning schedule. The essence of this chapter is to distribute your time over a longer interval in order to retrieve efficiently and ultimately learn better.

The Hidden Value of Ignorance

The chapter talks about “Fluency illusion”, the number one reason why many students flunk exams. You study formulae, concepts, theories etc. and you are under the illusion that you know everything until the day you see the examination paper. One way to come out of this illusion is to test oneself often. The word “test”, connotes different things to different people. For some teachers, it is a way to measure a student’s learning. For some students, it is something they need to crack to get through a course. The literature on “testing” has a completely different perspective. “Testing” is way of learning. When you take a test, you retrieve concepts from your memory and the very act of retrieving fundamentally alters the way you store those concepts. Testing oneself / taking a test IS learning. The chapter cites a study done on students shows the following results


The above results show that testing does not = studying. In fact, testing > studying and by a country mile on delayed tests. Researchers have come up with a new term to ward off some of the negative connotation associated with the word “test”; they call it “retrieval practice”. Actually this is a more appropriate term as testing oneself (answering a quiz / reciting from memory/ writing from memory) essentially is a form of retrieval that shapes learning. When we successfully retrieve something from the memory, we then re-store it in the memory in a different way than we did before. Not only has storage level spiked; the memory itself has new and different connections. It’s now linked to other related aspects that we have also retrieved. Using our memory changes our memory in ways we don’t anticipate. One of the ideas that the chapter delves in to is to administer a sample pre-final exam right at the beginning of the semester. The student will anyway flunk the exam. But the very fact that he gets to see a set of questions and looks at the pattern of questions before anything is taught, makes him a better learner by the end of semester.

Quitting before you are ahead

The author talks about “percolation”, the process of quitting an activity after we have begun and then revisiting at frequent intervals. Many writers explicitly describe this process and you can read their autobiographies to get in to the details. Most of the writers say something to this effect: “I start on a novel, then take a break and wander around a familiar/ unfamiliar environment, for when I do so, the characters tend to appear in the real/imaginary worlds who give clues to continue the story”. This seems to be to domain specific. May be it applies only to the “writing” field where after all writing about something is discovering what you think about it and it takes conscious quitting and revisiting your work.

The author cites enough stories to show that this kind of “percolation” effect can be beneficial to many other tasks. There are three elements of percolation. The first element of percolation is interruption. Whenever you begin a project, there will be times when your mind might say, “Moron quit it now, I can’t take it anymore”. Plodding through that phase is what we have been told leads to success. However this chapter suggests another strategy, “quit with the intention of coming back to it”. There is always a fear that we will never get back to working on it. But if it is something you truly care, you will get back to it at some point in time. An interesting thing happens when you quit and you want to get back to the activity after a break, the second element of percolation kicks in, i.e. your mind is tuned to see/observe things related to your work, everywhere. Eventually the third element of percolation comes in to play; listening to all the incoming bits and pieces of information from the environment and revisiting the unfinished project. In essence, having this mindset while working on a project means quitting frequently with the intention of returning to it, which tunes your mind to see things you have never paid attention to. I have seen this kind of “percolation” effect in my own learning process so many times that I don’t need to read a raft of research to believe that it works.

Being Mixed up

The author starts off by mentioning the famous beanbag tossing experiment of 1978 that showed the benefits of interleaved practice. This study was buried by academicians as it was against the conventional wisdom of “practice till you master it”. Most of the psychologists who study learning fall in two categories, first category focus on motor/movement and the second category focus on language/abstract skills. Studies have also proven that we have separate ways to memorize motor skills and language skills. Motor memories can be formed without hippocampus unlike declarative memories. Only in 1990s did researchers start to conduct experiments that tested both motor and declarative memories. After several experimental studies, researchers found that interleaving has a great effect on any kind of learning. The most surprising thing about interleaving is that the people who participated in the experiments felt that massed practice was somehow better, despite test scores showing that interleaving as a better alternative. One can easily relate to this kind of feeling. If you spent let’s say a day on something and you are able to understand a chapter in a book, you might be tempted to read the next chapter and the next until the difficulty level reaches a point where you need to put in far more effort to get through the concepts. Many of us might not be willing to take a break and revisit it, let’s say a week later or a month later. Why? These are following reasons based on my experience:

  • I have put so much effort in understanding the material ( let’s say the first 100 pages of a book). This new principle/theorem on the 101st page is tough. If I take a break and come back after a week or so, I might have to review all the 100 pages again which could be waste of time. Why not somehow keep going and put in a lot of effort in understanding the stuff on page 101 when all the previous 100 pages are in my working memory.
  • I might never get the time to revisit this paper/book again and my understanding will be shallow
  • Why give up when I seem to cruising along the material given in the book? This might be a temporary show stopper that I will slog it out.
  • By taking a break from the book, am I giving in to my lazy brain which does not want to work through the difficult part?
  • What is the point in reading something for a couple of hours, then reading something else for a couple of hours? I don’t have a good feeling that I have learnt something
  • I have put in so many hours in reading this paper/book. Why not put in some extra hours and read through the entire book?

The above thoughts, research says are precisely the ones that hamper effective learning. Interleaving is unsettling but it is very effective

Importance of Sleep

We intuitively know that a good sleep/quick nap brings our energy levels back. But why do humans sleep? One might think that since this is an activity that we have been doing since millennia, neuroscientists / psychologists / researchers would have figured out the answer by now. No. There is a no single agreed upon scientific explanation for it. There are two main theories that have been put forth. First is that sleep is essentially a time-management adaptation. Humans could not hunt or track in the dark. There was nothing much to do and automatically the internal body clock evolved to sleep during those times. Brown bat sleeps 20 hours and is awake for 4 hours in the dusk when it can hunt mosquitoes and moths. Many such examples give credence to this theory that we are awake when we there’s hay to be made and we sleep when there is none. The other theory it that sleep’s primary purpose is memory consolidation. Ok, if we take for granted that for some reason, evolution has made us crave for sleep, what happens to stuff that we learn? Does it get consolidated in sleep? The author gives a crash course on the five stages of sleep.


The five stages of sleep are illustrated in the above figure. There are bursts of REM(Rapid eye moment) in a typical 8 hr. sleep period. Typically one experiences a total of four to five REM bursts during the night–of 20 min of average duration. With its bursts of REM and intricate, alternating layers of wave patterns, the brain must be up to something during sleep. But what? For the last two decades there has been massive evidence that sleep improves retention and comprehension. Evidence has also shown mapping between Stage II of the sleep and motor skill consolidation, mapping between REM phase and learning skill consolidation. If you are a musician/artist preparing for tomorrow’s performance, it is better to practice late in to the night and get up little late so that Type II phase of sleep is completed. If you are trying to learn something academic, it makes sense to sleep early as REM phase comes up in the early stages of 8 hr. sleep period that helps you consolidate. Similar research has been done on “napping” and it has been found to be useful for learning consolidation. The brain is basically doing the function of separating signal from noise.

The Foraging brain

If effective learning is such a basic prerequisite to our survival in today’s world, why haven’t people figured out a way out to do it efficiently? There is no simple answer to this. The author’s response to this question is that our ideas of learning are at odds with the way our brain has been shaped over the millennia. Humans were foragers; hunting and tracking activities dominated human’s life for over a million years. The brain adapted to absorb – at maximum efficiency –the most valuable cues and survival lessons. Human brain too became a forager—for information, strategies, for clever ways to foil other species’ defenses. However its language, customs and schedules have come to define as how we think the brain should work—Be organized, develop consistent routines, concentrate on work, focus on one skill. All this sounds fine until we start applying in our daily lives. Do these strategies make us effective learners?

We know intuitively that concentrating on something beyond a certain time is counterproductive, mass practice does not lead to longer retention; it is difficult to be organized when there are so many distractions. Instead of adapting our learning schedules to the foraging brain, we have been trying to adapt our foraging brains( something that has evolved over a millennia) to our customs/schedules/notions about learning things( something that has happened over the few thousand years). The author says that this is the crux of the problem. This has kept us at bay in becoming effective learners. The foraging brain of the past that brought us back to our campsite is the same one that we use to make sense of the academic and motor domains. Most often when we do not understand something, the first instinct is to give up. However this feeling of “lost” is essential for the foraging brain to look for patterns, aid your brain in to creating new pathways to make sense of the material. This reinforces many of the aspects touched upon in this book:

  • If you do not forget and you are not lost, you do not learn.
  • If you do not space out learning, you do not get lost from time to time and hence you do not learn.
  • If you do not use different contexts/physical environments to learn, your brain has fewer cues to help you make sense of learning.
  • If you do not repeatedly test yourself, the brain doesn’t get feedback and the internal GPS becomes rusty

It is high time to adapt our notions of learning to that of our foraging brain; else we will be forever trying to do something that our brains will resist.


There are some counterintuitive strategies for learning that are mentioned in this book—changing the physical environment of your study, spaced repetition, testing as a learning strategy, interleaving, quitting and revisiting project frequently, welcoming distractions in your study sessions etc. Most of these are different from the standard suggestions on “how to learn”. However the book collates all the evidence from the research literature and argues that these strategies are far more effective for learning than what we have known before.


In a world where uncertainty is the norm, “being curious” is one of the ways to hedge volatility in our professional and personal life. By developing and maintaining a state of curiosity in whatever we do, we have a good chance of leading a productive life. The author of this book, Ian Leslie, is a journalist and it should not come as a surprise that this book’s content is essentially “annotating a set of articles and books on curiosity”. The book is a little longer than a blog post / newspaper article and falls short of a well researched book.

We all kind of intuitively know that real learning comes from being curious. Does one need to read a book to know about it? Not really, if you understand that curiosity is vulnerable to benign neglect, if you truly understand what feeds curiosity and what starves it. Unless we are consciously aware of it, our mind might take us in a direction where we are comfortable with the status quo. The more we can identify the factors that keep us in a “curious state”, the better we are, at being in one, or at least in making an effort to get in to that state. This book give visuals/metaphors/examples that gives us some idea of what “others” have talked / written / experienced about curiosity.

Firstly a few terms about curiosity itself. Broadly there are two kinds of curiosities, First is the diversive curiosity , a restless desire for new and next. The other kind is epistemic curiosity, a more disciplined and effortful inquiry, “keep traveling even when the road is bumpy” kind of curiosity. The Googles/wikis/MOOCs of the world whet our diversive curiosity. But that alone is not enough. From time to time, we need to get down and immerse ourselves to get a deeper understanding of stuff around us. If we are forever in the state of diversive curiosity, our capacity for the slow, difficult and frustrating process of gathering knowledge i.e. epistemic curiosity, may be deteriorating.

The author uses Isaiah Berlin’s metaphor of Hedgehog vs. Fox and says that we must be “Foxhogs”. Foxhogs combine the traits of a hedgehog (who has a substantial expertise in something) and a fox ( who is aware of things happening in a ton of other areas). Curious learners go deep and they go wide. Here is a nice visual that captures the traits of a “Foxhog”


Typically they say a startup with 2 or three founders is ideal. I guess the reason might be that atleast the team as a whole, satisfies “foxhog” criterion. Wozniak was a hedgehox and Steve Jobs was a fox, their combination catapulted Apple from a garage startup to what it is today. Alexander Arguelles (I can speak 50 languages) is another foxhog. Charles Darwin, Charlie Munger, Nate Silver are all “foxhogs” who have developed T shaped skillsets.

Tracing the history of curiosity, the author says that, irrespective of the time period you analyze, there has always been a debate between Diversive vs. Epistemic curiosity. In today’s digital world too, with the onslaught of social media and ever increasing attention seeking tools, how does one draw a line between Diversive and Epistemic curiosity appetites? One of the consequences of “knowledge available at a mouse click” is that it robs you of the “desirable difficulty” that is essential for learning. “Slow to learn and slow to forget” is becoming difficult as Internet provides you instant solutions making our learning in to, “easy to learn and easy to forget” activity. Google gives answers to anything that  you question but it won’t tell you what questions to ask. Widening information access does not mean curiosity levels have increased. Ratchet effect is an example of this phenomenon.

Via Omniscience bias:

James Evans, a sociologist at the University of Chicago, assembled a database of 34 million scholarly articles published between 1945 and 2005. He analysed the citations included in the articles to see if patterns of research have changed as journals shifted from print to online. His working assumption was that he would find a more diverse set of citations, as scholars used the web to broaden the scope of their research. Instead, he found that as journals moved online, scholars actually cited fewer articles than they had before. A broadening of available information had led to “a narrowing of science and scholarship”. Explaining his finding, Evans noted that Google has a ratchet effect, making popular articles even more popular, thus quickly establishing and reinforcing a consensus about what’s important and what isn’t. Furthermore, the efficiency of hyperlinks means researchers bypass many of the “marginally related articles” print researchers would routinely stumble upon as they flipped the pages of a printed journal or book. Online research is faster and more predictable than library research, but precisely because of this it can have the effect of shrinking the scope of investigation.

The book makes a strong argument against the ideas propagated by people like Ken Robinson, Sugata Mitra, who claim that knowledge is obsolete, self-directed learning is the only way to educate a child, banishing memorizing stuff from syllabus is important etc. In the late nineteenth and twentieth centuries, a series of thinkers and educators founded “progressive” schools, the core principle of which was the teachers must not get in the way of the child’s innate love of discovery. Are these observations based on evidence? The author cites a lot of empirical research findings and dispels each of the following myths:

  • Myth 1 : Children don’t need teachers to instruct them
  • Myth2 : Fact kills creativity
  • Myth3 : Schools should teach thinking skills instead of knowledge

The last past of the chapter gives a few suggestions to the readers that enable them to be in a “curious state”. Most of them are very obvious but I guess the anecdotes and stories that go along with the suggestions helps one to be more cognizant about them. Here are two of the examples from one of the sections :

International Boring Conference

The Boring Conference is a one-day celebration of the mundane, the ordinary, the obvious and the overlooked – subjects often considered trivial and pointless, but when examined more closely reveal themselves to be deeply fascinating. How often do we pause and look at mundane stuff ?

Georges Perec(Question your tea-spoons)

Perec urges the reader to pay attention not only to the extraordinary but to — as he terms it— the infraordinary, or what happens when nothing happens. We must learn to pay attention to “the daily,” “the habitual”: What we need to question is bricks, concrete, glass, our table manners, our utensils, our tools, the way we spend our time, our rhythms. To question that which seems to have ceased forever to astonish us. We live, true, we breathe, true; we walk, we open doors, we go down staircases, we sit at a table in order to eat, we lie down on a bed to go to sleep. How? Where? When? Why? Describe your street. Describe another street. Compare.Make an inventory of your pockets, of your bag. Ask yourself about the provenance, the use, what will become of each of the objects you take out. Question your tea-spoons.

The book ends with a quote by T.H.White

“The best thing for being sad," replied Merlin, beginning to puff and blow, "is to learn something. That’s the only thing that never fails. You may grow old and trembling in your anatomies, you may lie awake at night listening to the disorder of your veins, you may miss your only love, you may see the world about you devastated by evil lunatics, or know your honour trampled in the sewers of baser minds. There is only one thing for it then — to learn. Learn why the world wags and what wags it. That is the only thing which the mind can never exhaust, never alienate, never be tortured by, never fear or distrust, and never dream of regretting. Learning is the only thing for you. Look what a lot of things there are to learn.”


This book is mainly targeted at high school / college kids who feel their learning efforts are not paying off, teachers who are on the look out for effective instruction techniques, parents who are concerned with their child’s academic results and want to do something about it.

The author of the book, Dr. Barbara Oakley, has an interesting background. She served in the US army as a language translator before transitioning to academia. She is now a professor of engineering at Oakland University in Rochester, Michigan. In her book, she admits that she had to completely retool her mind. A person who was basically in to artsy kind of work had to read hard sciences to get a PhD and do research. Needless to say the transition was a frustrating experience.  One of her research areas is neuroscience where she explores effective human learning techniques. The author claims that her book is essentially meant to demystify some of the common notions that we all have about learning.

This book is  written in “personal journal” format,i.e. with images, anecdotes, stories etc. It is basically a collection of findings that are scattered in various places such as academic papers, blogs, pop science books. So, this book does the job of an “aggregator” , ,much like a Google search, except that the results are supplemented with comments and visuals.

Some of the collated findings mentioned in the book are  :

1) Focused vs.. Diffused mode of thinking : Tons of books have already been written on this subject. The book provides a  visual to remind the reader the basic idea behind it.


In the game “pinball,” a ball, which represents a thought, shoots up from the spring-loaded plunger to bounce randomly against rows of rubber bumpers. These two pinball machines represent focused (left) and diffuse (right) ways of thinking. The focused approach relates to intense concentration on a specific problem or concept. But while in  focused mode , sometimes you inadvertently find yourself focusing intently and trying to solve a problem using erroneous thoughts that are in a different place in the brain from the “solution” thoughts you need to actually need to solve the problem. As an example of this, note the upper “thought” that your pinball first bounces around in on the left-hand image. It is very far away and completely unconnected from the lower pattern of thought in the same brain. You can see how part of the upper thought seems to have an underlying broad path. This is because you’ve thought something similar to that thought before. The lower thought is a new thought— it doesn’t have that underlying broad pattern. The diffuse approach on the right often involves a big-picture perspective. This thinking mode is useful when you are learning something new. As you can see , the diffuse mode doesn’t allow you to focus tightly and intently to solve a specific problem— but it can allow you to get closer to where that solution lies because you’re able to travel much farther before running into another bumper.

2)  Spaced repetition : This idea has lead a massive research area in the field of cognitive psychology. The book nails it with the following visual :


Learning well means allowing time to pass between focused learning sessions , so the neural patterns have time to solidify properly. It’s like allowing time for the mortar to dry when you are building a brick wall, as shown on the left. Trying to learn everything in a few cram sessions doesn’t allow time for neural structures to become consolidated in your long-term memory— the result is a jumbled pile of bricks like those on the right.

3) Limited short term memory :
Experiments have shown that you can at max hold 4 items in your working memory. This means the key to making sense of stuff lies in effective storage and retrieval of concepts/ideas from your long term memory than trying to cram everything in to working memory(which will anyway vanish quickly)

4) Chunking : From KA Ericsson (academician behind the notion of “deliberate practice{ ) to Daniel Coyle (pop science book author)  -  all have emphasized this aspect. Again a visual to summarizes the key idea :


When you are first chunking a concept, its pre-chunked parts take up all your working memory, as shown on the left. As you begin to chunk the concept, you will feel it connecting more easily and smoothly in your mind, as shown in the center. Once the concept is chunked, as shown at the right, it takes up only one slot in working memory. It simultaneously becomes one smooth strand that is easy to follow and use to make new connections. The rest of your working memory is left clear. That dangling strand of chunked material has, in some sense, increased the amount of information available to your working memory, as if the slot in working memory is a hyperlink that has been connected to a big webpage.

5) Pomodoro to prevent procrastination : Knowledge scattered around various blogs and talks are put in one place. The idea is that that you do work in slots of (25min work + 5 min break).image

6) { (Recall + Test > Reread) , ( Interleave + Spaced repetition > massed practice )  }
– These ideas resonate through out the book “Make it Stick”. This book though summarized the ideas and supplements them with this visuals such as :


Solving problems in math and science is like playing a piece on the piano. The more you practice, the firmer, darker, and stronger your mental patterns become.


If you don’t make a point of repeating what you want to remember, your “metabolic vampires” can suck away the neural pattern related to that memory before it can strengthen and solidify.

7) Memory enhancement hacks :
Most of the ideas from “Moonwalking with Einstein” and other such memory hack books are summarized for easy reading

8) Reading / engaging in diverse material pays off : This has been a common trait amongst many people who do brilliant stuff. Pick up any person who has accomplished something significant, you will find they have varied interests.


Here you can see that the chunk— the rippling neural ribbon— on the left is very similar to the chunk on the right. This symbolizes the idea that once you grasp a chunk in one subject, it is much easier for you to grasp or create a similar chunk in another subject. The same underlying mathematics, for example, echo throughout physics, chemistry, and engineering— and can sometimes also be seen in economics, business, and models of human behavior. This is why it can be easier for a physics or engineering major to earn a master’s in business administration than someone with a background in English or history. Metaphors and physical analogies also form chunks that can allow ideas even from very different areas to influence one another. This is why people who love math, science , and technology often also find surprising help from their activities or knowledge of sports, music, language, art, or literature.

9) Adequate sleep is essential for better learning : This is like turning the lights off on the theatre stage so that artists can take a break, relax and come back for their next act. Not turning off the mind and overworking can only lead us to an illusion of learning, when in fact all we are doing is showcasing listless actors on the stage(working memory).


Toxins in your brain get washed away by having an adequate amount of sleep everyday.

The book can easily be read in an hour or two as it is filled with lot of images/ metaphors/ anecdotes and recurrent themes. The content of this book is also being offered in the form of  4 week course at Coursera

Lady Luck favors the one who tries

– Barbara Oakley


In today’s world, parents are extremely observant about how their children are learning. Be it academics or music or sport any other field that the child has developed a semblance of liking, the parent gives and seeks all the guidance available to make his/her kid’s learning process effective. Given the hyperconnected instant gratification world that we are all living it, Kids left to their own devices, become just that, in the literal sense. Their lives are surrounded by world of devices (cell phones, gaming consoles, ipod, ipad, etc.) and naturally they develop an affinity towards them. One doesn’t need some academic research to infer that attention spans are going down across all age groups, more so in children. In such an environment, can parents or teachers be confident that the children develops thinking and meta-thinking(thinking about how they are thinking )skills to become effective learners ?.

There is a mad rush towards alternative education schools everywhere. Parents are under the notion that schools that focus on standardized testing and standardized learning might not be effective for their kid whom they think is somehow “special” from everyone. In what sense they are “special”, only future would tell, but that doesn’t stop them from thinking that education must be somehow customized to suit their kid’s learning style.

In my own family, I have seen my cousin’s kids being put through a school where there are no tests at all until 8th or 9th grade. The school advertises to the general public saying that their USP is small class room sizes and NO TESTS. The admission process in the school creates a massive frenzy amongst everyone and even gets cited in the local newspapers. Parents feel that this “NO TEST” environment will unleash creativity amongst their kids and turn their little ones in to a creative genius. Is it really true that an environment without tests fosters good learning? Why is there a universal backlash against “tests”? Why is everyone fixated on “learning styles”? What is wrong with the current educational system? How does one become an effective learner? These and many more questions are answered in this book. Here is an attempt to summarize the book.

Learning Is Misunderstood

This chapter is a prelude to the book and lists down the claims that the authors verify via field research in various chapters of the book. What are the claims made at the outset?

  • Learning is deeper and more durable when it’s effortful. Learning that’s easy is like writing in sand, here today and gone tomorrow.
  • We are poor judges of when we are learning well and when we’re not. When the going is harder and slower and it doesn’t feel productive, we are drawn to strategies that feel more fruitful, unaware that the gains from these strategies are often temporary.
  • Rereading text and massed practice of a skill or new knowledge are by far the preferred study strategies of learners of all stripes, but they’re also among the least productive. By massed practice we mean the single-minded, rapid-fire repetition of something you’re trying to burn into memory, the “practice-practice-practice” of conventional wisdom. Cramming for exams is an example . Rereading and massed practice give rise to feelings of fluency that are taken to be signs of mastery, but for true mastery or durability these strategies are largely a waste of time.
  • Retrieval practice—recalling facts or concepts or events from memory— is a more effective learning strategy than review by rereading. Periodic practice arrests forgetting, strengthens retrieval routes, and is essential for hanging onto the knowledge you want to gain.
  • When you space out practice at a task and get a little rusty between sessions, or you interleave the practice of two or more subjects, retrieval is harder and feels less productive, but the effort produces longer lasting learning and enables more versatile application of it in later settings.
  • Trying to solve a problem before being taught the solution leads to better learning, even when errors are made in the attempt.
  • People do have multiple forms of intelligence to bring to bear on learning, and you learn better when you “go wide,” drawing on all of your aptitudes and resourcefulness, than when you limit instruction or experience to the style you find most amenable.
  • When you’re adept at extracting the underlying principles or “rules” that differentiate types of problems, you’re more successful at picking the right solutions in unfamiliar situations. This skill is better acquired through interleaved and varied practice than massed practice.
  • In virtually all areas of learning, you build better mastery when you use testing as a tool to identify and bring up your areas of weakness.
  • Elaboration is the process of giving new material meaning by expressing it in your own words and connecting it with what you already know. The more you can explain about the way your new learning relates to your prior knowledge, the stronger your grasp of the new learning will be, and the more connections you create that will help you remember it later.
  • Rereading has three strikes against it. It is time consuming. It doesn’t result in durable memory. And it often involves a kind of unwitting self-deception, as growing familiarity with the text comes to feel like mastery of the content.
  • It makes sense to reread a text once if there’s been a meaningful lapse of time since the first reading, but doing multiple readings in close succession is a time-consuming study strategy that yields negligible benefits at the expense of much more effective strategies that take less time. Yet surveys of college students confirm what professors have long known: highlighting, underlining, and sustained poring over notes and texts are the most-used study strategies, by far.
  • Rising familiarity with a text and fluency in reading it can create an illusion of mastery. As any professor will attest, students work hard to capture the precise wording of phrases they hear in class lectures, laboring under the misapprehension that the essence of the subject lies in the syntax in which it’s described. Mastering the lecture or the text is not the same as mastering the ideas behind them . However, repeated reading provides the illusion of mastery of the underlying ideas. Don’t let yourself be fooled. The fact that you can repeat the phrases in a text or your lecture notes is no indication that you understand the significance of the precepts they describe, their application, or how they relate to what you already know about the subject.

All the above claims are verified by experiments carried out in various school settings and other unconventional places. The authors at the very beginning make it clear that the learning theories that have been handed down to us have been a result of theory, lore and intuition. But over the last forty years and more, cognitive psychologists have been working to build a body of evidence to clarify what works and to discover the strategies that get results.

To Learn Retrieve

We all forget things. If they are trivial stuff, they really don’t matter. But if they are key principles, concepts, then our learning will be stunted and it becomes painfully obvious that we need to re-read the forgotten stuff. To give a specific example, let’s say I am learning about Jump modeling and there is an introductory section on Poisson processes. In the past I would have spent some time going over Poisson processes and understanding the math behind it. The key theorems are somewhere in my memory. Not all are at my beck and call. So, whenever I come across a concept that I have tough time recalling, my usual strategy is to re-read the old section. Not an ideal strategy, says this chapter. I think most of us follow the above strategy where re-reading is the goto choice to make things fresh. The chapter focuses on one key point – retrieval practice. This is a kind of practice where you make an effort to recall those concepts from your memory, reflect on those concepts from time to time. This is not the same as rereading the text.

The authors make a strong case for “testing” as a means of retrieval practice. The retrieval effect in the cognitive psychology field is known as “testing effect”. Through the results of various experiments conducted, the authors suggest that testing immediately after a lecture, or testing yourself at spaced intervals is far better than rereading at spaced intervals. Repeated retrieval ties the knot of memory. Retrieval must be spaced out rather than becoming a mindless repetition. It should require cognitive effort. The authors back up these suggestions with field experiments that show that frequent testing of students with delayed feedback gave better performance than merely rereading or revisiting the material before midterm and end term.

For an adult learner, how does this apply? I guess one must self test, even though it is painful. It is actually better if it is painful as it leads to greater effort at retrieval and hence better learning. How should the tests be designed? For programming there are many suggestions out there. For something like math, I think the best way is to read a theorem and try to give a proof in your own words trying to recall whatever you have learnt the previous time around. Merely reading through the proofs or concepts will not make the learning stick. This is called the “generation effect “. You generate the proof from some clues. In the case of frameworks or set of ideas, you can probably write an essay recalling all the aspects of the theory without rereading. It is easy to fall in to the trap of ,” Ok I have forgotten, let me reread the material”. Instead this chapter says that one must pause, take a self test, then quiz yourself as you go over the material again, and then reflect on what you have relearnt. This has a term called “elaboration” in the literature. This means you elaborate the learning or practice session so that memory paths are strengthened. The other thing I have started following recently is to take a small 60 page booklet and keep noting down whatever you find interesting for the day in the form of statements, visuals or just about anything that captures the learning. Obviously as you go along these 60 page diaries accumulate. Once in a while you can pick a booklet and read the statements that you found interesting a month ago, 6 months ago, a year ago. This is a kind of retrieval practice where you are trying to learn better by testing yourself.


Mix up practice

The authors introduce a term called “mass practice” where you keep practicing one aspect of skill development until you are good at it and then move on. This is clearly seen in textbooks where each chapter is followed by a set of problems that are only relevant to that chapter and the reader is asked to practice the exercises and then move on to the next chapter. This is the usual advice passed on to us from many people. Practice, practice until the skill is burned in to the memory. Faith in focused repetitive practice is everywhere and this is what the authors mean by “mass practice” Practice that is spaced out, interleaved with other learning, and varied produces better mastery, longer retention and more versatility. There is one price to pay though. From the part of the learner, it requires more effort. There are no quick positive affirmations that come with mass practice. Let’s say you have been immersed in doing Bayesian analysis for a few months, then you take a break and get back to it, there will be an inherent slowness at which you can digest things as those concepts are lying somewhere in your long term memory and invoking them takes effort. But the authors say that this is a good thing. Even though learning feels slower, this is the way to go.

This phenomenon of “mass practice” is everywhere. Summer camps, focused workshops, training seminars. Spacing out your practice feels less productive for the very reason that some forgetting has set in and you’ve got to work harder to recall concepts. It doesn’t feel like you’re on top of it. What you don’t sense in the moment is that the added effort is making the learning stronger.

Why does spaced practice work? Massed practice is good for short term memory. But the sad part is that such a practice does not lead to durable learning. For something to get in to long term memory, there should be consolidation in which memory traces are strengthened. If you do not activate these memory traces, then the paths will be lost. It is like laying a new road and using it for a week or so, and then moving on. Unless you use the road often, the material never gets a chance to become strong and in the process loses vitality.

Practice also has to interleaved. Interleaving is practicing two or more subjects or two different aspects of the same subject . You cannot study one aspect of subject completely and move on to another subject and so on. Linearity isn’t good. Let’s say you are learning some technique, for example EM algorithm. If you stick to data mining field, you will see its application in let’s say mixture estimation. However by interleaving your practice with let’s say state space models, you see that EM algorithm being used to estimate hyperparameters of a model. This interleaving of various topics gives a richer understanding. Obviously there is a price to pay. The learner is just about learning to understand something, when he is asked to move to another topic. So, that sense of feeling that he hasn’t got a full grasp on the topic remains. It is a good thing to have but an unpleasant situation that a learner must handle.

Varied practice – Let’s say you are a quant and trying to build financial models. Varied practice in your case would be to build a classification model, a Bayesian inference model, a Brownian motion based model, a more generic Levy process based model, a graph based model etc. The point is that you develop a broader understanding of relationships between various aspects of model building. If you stick to let’s say financial time series for an year, then move on to machine learning for another year, it is likely that you are going to miss connections between econometric models and machine learning models. Having said that, it is not a pleasant feeling to incorporate varied practice in one’s schedule. Imagine you are just about understand the way to build a particle filter, a technique to do online estimation of state vector and you have already spent quite an amount of time on that subject. It is time to move to another area, let’s say building a levy process based model. As soon as you start doing something on Levy process, you sense that your knowledge Poisson + Renewal processes is very rusty and the learning is extremely slow. This is the unpleasant part. But the authors have a reassuring message. When the learning appears slow and effortful, that is where real learning is taking place.

Compared to mass practice, a significant advantage of interleaving practice is that they help us learn to assess context and discriminate between problems, selecting and applying problems from a range of possibilities. The authors give an example of “learning painting styles” to drive home the point of interleaved and varied practice.

Practice like you play and you will play like you practice. The authors stress the importance of simulations for better practice. If you are in to trading strategy development, simulating time series and testing the strategy out of sample is fundamental for a better understanding of the strategy. In fact, with the rise of MCMC, the very process of estimation and model selection is done via simulation. The authors also bring out an example where daily reflection can be done as a form of retrieval practice.

I liked the last section of this chapter where the authors share the story of Georgia university football coach who is following the principles of retrieval, spacing, interleaving, variation, reflection and elaboration, in making his college team a better playing team.


Embrace Difficulties

Short term impediments that make for stronger learning have come to be called desirable difficulties, a term coined by Elizabeth and Robert Bjork. The chapter starts with an example of military school where the trainees are not allowed to carry note books or write stuff. They have to listen, watch, rehearse and execute. Testing is a potent reality check on the accuracy of your own judgment of what you know how to do. The process of strengthening the long term memory is called consolidation. Consolidation and transition of learning to long-term storage occurs over a period of time. An apt analogy for how the brain consolidates new learning is the experience of composing an essay. Let’s say you are studying point processes, a class of stochastic processes. First time around you might not be able to appreciate all the salient points of the text. You start out feeling disorganized and the most important aspects are not salient. Consolidation and retrieval helps solidify these learning’s. If you are practicing over and over again in some rapid-fire fashion, you are leaning on short term memory and very little mental effort is needed. There is an instant improvement, but the improvement is not robust enough to sustain. But if you practice by spacing and interleaving, the learning is much deeper and you will retrieve far easily in the future.

Durable robust learning means we do two things – First, as we recode and consolidate new material from short term memory into long term memory, we must anchor there securely. Second, we must associate it with a diverse set of cues that will make us adept at recalling the material later. Having effective retrieval clues is essential to learning and that is where tools like mindmaps help a lot. The reason we don’t remember stuff is that we don’t practice and apply it. If you are in to building say math/stat models, it is essential to at least simulate some data set, build a toy model so that practice gets some kind of anchorage for retrieval. Without this, any reading of a model will stay in your working memory for some time and then vanish. Knowledge, learning and skills that are vivid, hold significance, and those that are practiced periodically stay with us. Our retrieval capacity is limited and is determined by the context, by recent use, and the number and vividness of the cues that you have linked to the knowledge and can call on to help it bring it forth.

Psychologists have uncovered a curious inverse relationship between ease of retrieval practice and the power of that practice to entrench learning, the easier knowledge or skill is for you to retrieve, the less your retrieval practice will benefit your retention of it.

There is an excellent case study of a baseball team where the team is split in to two and they are given varied practice regimen. First group practices 45 pitches evenly divided in to sets of three where each set has a specific type of pitch thrown. The second group also practices 45 pitches but this time, the pitches were randomly interspersed. After the training, the first group feels good about their practice while the second group feels that they were not developing their skills properly. However when it came to the final performance test, the second group performed far better than the first group. This story illustrates two points – first, our judgments of what learning strategies work best for us are often mistaken, colored by illusions of mastery. Second, some difficulties that require more effort and slow down apparent gains will feel less productive but will more than compensate for that by making the learning stronger, precise and durable. The more you have forgotten a topic, the more effective relearning will be in shaping your permanent knowledge. The authors also make it a point to highlight that if you struggle to solve a problem before being shown how to solve it, the subsequent lesson is better learned and more durably remembered.

This chapter and this book is an amazing fountainhead of ideas that one can use. Not everything is new but the fact that there is an empirical evidence to back it up means that you know that it is not folklore wisdom. One thing I learnt from the book which has reinforced my way of learning is “write to learn”. After reading a book or reading a concept, I try to write it down so that I can relate to things that I have already learnt, relate to aspects of the field that I want to eventually apply etc. This obviously takes up a lot of time, but the learning is far more robust. I think book summaries that I manage to write is one of the best ways to reflect on the main contents of the book. I tend to write a pretty detailed summary of key ideas so that the summary serves as a material for retrieval practice at a later point in time.

The other idea this chapter talks about is about the need to commit errors to solidify learning. Came to know about “Festival of errors” and “Fail conference”. There is also a story about Bonnie, a writer and self-taught ornamental gardener, who follows the philosophy, “ leap before you look because if you look, you probably won’t like what you see”. Her garden writing appears under the name “Blundering Gardener”. Bonnie is a successful writer and her story goes on to show that struggling with a problem makes for stronger learning and how a sustained commitment to advancing in a particular field of endeavor by trial-and-error leads to complex mastery and greater knowledge of interrelationships of things. Bonnie’s story is pretty inspiring for anyone who wishes to tackle a difficult field. By going head long in to the field and learning from the trial and error process, and then writing about the entertaining snafus and unexpected insights, she is doing two things. Firstly, she is retrieving the details and elaborating the details. “Generative learning” means learner is generating the answer than recalling it. Basically it means learn via trial-and-error.


  • Learning is a three step process – Initial encoding, consolidation and retrieval
  • Ability to recall what you already have depends on the repeated use of information and powerful retrieval clues
  • Retrieval practice that’s easy does little to strengthen learning, the more difficult the practice, the greater the benefit
  • Retrieval needs to be spaced. When you recall something from your memory when it has already become rusty, you need more effort and this effortful retrieval strengthens memory and makes learning pliable
  • Practice needs to be interleaved and varied
  • Trying to come up with an answer rather than presenting it to you leads to better learning and retention

Avoid Illusions of knowing

The chapter starts by describing two modes of thinking System 1 and System 2, from Daniel Kahneman’s book and says that we base our actions based on System 1 more often than System 2. Our inclination to finding narratives means that it has a significant say in our memory capabilities. There are lot of illusions and misjudgments that we carry along. One way to escape from them is to replace subjective experience as the basis for decisions with a set of objective gauges outside ourselves, so that our judgments squares with the real world around us. When we have reliable reference points, we can make good decisions about where to focus our efforts, recognize where we’ve lost our bearings, and find out way again. It is important to pay attention to the cues you are using to judge what you have learned. Whether something feels familiar or fluent is not always a reliable indicator of learning. Neither is your level of ease in retrieving a fact or phrase on a quiz shortly after the text. Far better is to create a mental model of the material that integrates various ideas of the text, connects to what you already know, and enables you to draw inferences. How ably you can explain the text is an excellent cue for judging comprehension, because you must recall the salient points, put in your own words and give the logic of how it connects to everything else.

Get beyond your learning styles

It is a common statement that you come across in the media, “every kid is different, the learning style has to be specific and catering to the kid’s learning style”. On the face of it, the statement looks obvious. Empirical evidence however does not support it. The authors give a laundry list of all the learning styles that have been put forth and say that there is absolutely no evidence that catering to individual learning style makes any difference. The simple fact that different theories embrace such wildly discrepant dimensions gives cause for concern about their scientific underpinnings. While it’s true that most all of us have a decided preference for how we like to learn new material, the premise behind learning styles is that we learn better when the mode of presentation matches the particular style in which an individual is best able to learn. That is the critical claim.

The authors say that

Moreover, their review showed that it is more important that the mode of instruction match the nature of the subject being taught: visual instruction for geometry and geography, verbal instruction for poetry, and so on. When instructional style matches the nature of the content, all learners learn better, regardless of their differing preferences for how the material is taught.

So, if the learning styles don’t matter, how should one go about ? The authors mention two aspects here

  1. Structure building: There do appear to be cognitive differences in how we learn, though not the ones recommended by advocates of learning styles. One of these differences is the idea mentioned earlier that psychologists call structure building: the act, as we encounter new material, of extracting the salient ideas and constructing a coherent mental framework out of them. These frameworks are sometimes called mental models or mental maps. High structure- builders learn new material better than low structure-builders.
  2. Successful intelligence: Go wide: don’t roost in a pigeonhole of your preferred learning style but take command of your resources and tap all of your “intelligences” to master the knowledge or skill you want to possess. Describe what you want to know, do, or accomplish. Then list the competencies required, what you need to learn, and where you can find the knowledge or skill. Then go get it. Consider your expertise to be in a state of continuing development, practice dynamic testing as a learning strategy to discover your weaknesses, and focus on improving yourself in those areas. It’s smart to build on your strengths, but you will become ever more competent and versatile if you also use testing and trial and error to continue to improve in the areas where your knowledge or performance are not pulling their weight.


Increase your abilities

This chapter starts off by giving some famous examples like the popular Marshmallow study, Memory athletes to drive home the point that brain is every changing. This obviously means that the authors take the side of nurture in the nature vs. nurture debate. The brain is remarkably plastic, to use the term applied in neuroscience, even into old age for most people. The brain is not a muscle, so strengthening one skill does not automatically strengthen others. Learning and memory strategies such as retrieval practice and the building of mental models are effective for enhancing intellectual abilities in the material or skills practiced, but the benefits don’t extend to mastery of other material or skills. Studies of the brains of experts show enhanced myelination of the axons related to the area of expertise but not elsewhere in the brain. Observed myelination changes in piano virtuosos are specific to piano virtuosity. But the ability to make practice a habit is generalizable. To the extent that “brain training” improves one’s efficacy and self-confidence, as the purveyors claim , the benefits are more likely the fruits of better habits, such as learning how to focus attention and persist at practice.

After an elaborate discussion on IQ, the authors suggest three strategies to amp up the performance levels :

  1. Maintaining a growth mindset — Carol Dweck’s work is used as the supporting argument. Dweck came to see that some students aim at performance goals, while others strive toward learning goals. In the first case, you’re working to validate your ability. In the second, you’re working to acquire new knowledge or skills. People with performance goals unconsciously limit their potential. If your focus is on validating or showing off your ability, you pick challenges you are confident you can meet. You want to look smart, so you do the same stunt over and over again. But if your goal is to increase your ability, you pick ever-increasing challenges , and you interpret setbacks as useful information that helps you to sharpen your focus, get more creative, and work harder.
  2. Deliberate Practice – Well, this has become a common term after many authors have written journalistic accounts of Anders Ericsson’s research. In essence it means that expert performance in medicine, science, music, chess, or sports has been shown to be the product not just of innate gifts, as had long been thought, but of skills laid down layer by layer, through thousands of hours of dedicated practice.
  3. Memory cues – Until a learner develops a deep learning of a subject, he/she can resort to mnemonic devices. Conscious mnemonic devices can help to organize and cue the learning for ready retrieval until sustained, deliberate practice and repeated use form the deeper encoding and subconscious mastery that characterizes expert performance.

It comes down to the simple but no less profound truth that effortful learning changes the brain, building new connections and capability. This single fact— that our intellectual abilities are not fixed from birth but are, to a considerable degree, ours to shape— is a resounding answer to the nagging voice that too often asks us “Why bother?” We make the effort because the effort itself extends the boundaries of our abilities. What we do shapes who we become and what we’re capable of doing. The more we do, the more we can do. To embrace this principle and reap its benefits is to be sustained through life by a growth mindset. And it comes down to the simple fact that the path to complex mastery or expert performance does not necessarily start from exceptional genes, but it most certainly entails self-discipline, grit, and persistence ; with these qualities in healthy measure, if you want to become an expert, you probably can. And whatever you are striving to master, whether it’s a poem you wrote for a friend’s birthday, the concept of classical conditioning in psychology, or the second violin part in Hayden’s Fifth Symphony, conscious mnemonic devices can help to organize and cue the learning for ready retrieval until sustained, deliberate practice and repeated use form the deeper encoding and subconscious mastery that characterize expert performance.


Make It Stick

The authors implement the lessons from the book within the confines of the book itself. This chapter is like a spaced repetition of all the ideas mentioned in the previous chapters. So, if you don’t bother about the empirical evidence, you can just read this chapter and take them at face value, incorporate them in your schedule and see if they make sense.


This is by far best book I have read that talks about “ how to go about learning something? ”. There are gems in this book that any learner can incorporate in one’s schedule and see a drastic change in their learning effectiveness. The book is a goldmine for students, teachers and life-long learners. I wish this book was published when I was a student!.


I liked Daniel Coyle’s “Talent Code” that talks about the importance of “deep practice” in achieving mastery in any field.  Not for the message of deep practice as it was already repeated in many books/articles, but for the varied examples in the book.

Here comes another book on the same lines by the same author. This book is a collection of thoughts and ideas from author’s field work, packaged as “TIPS” to improve one’s skillset.  These tips are categorized in to three categories, “Getting Started”, “Improving Skills”, and “Sustaining Progress”.

I will just list down some of the tips from each of the sections, mostly from the perspective of someone wanting to improve his programming skills.

Getting Started:

  • Spend fifteen minutes a day engraving the skill on your brain : In these days of abundant online instructional video content, one can easily watch an ace programmer demonstrating his hacking skills. May be  watching such videos daily, will engrave the skill on the brain. Also one can think of revisiting codekata from time to time.
  • Steal with out apology : Copy other’s code and improvise. The latter part of learning and improvising is the crux.When you steal, focus on specifics, not general impressions. Capture concrete facts. For me reading Hadley Wickham’s code provides me with far more feedback than any debugger in the world.
  • Buy a notebook  : Need to do this and track my errors on a daily basis
  • Choose Spartan over Luxurious : Spot on.
  • Hard Skill or a Soft Skill : I think a skill like programming falls somewhere between. You got to be consistent and stick to basic fundas of a language. But at the same time. you got to constantly read , recognize and react to situations to become a better programmer.
  • To build hard skills, work like a careful carpenter : Learning fundamentals of any language may seem boring. When the novelty wears off, after some initial few programs, a programmer needs to keep going, learning bit by bit, paying attention to errors and fixing them.
  • Honor the hard skills : Prioritize the hard skills because in the long run they’re more important to your talent. As they say, for most of us, there is no instant gratification in understanding a math technique. However you got to trust that they will be damn useful in the long run.
  • Don’t fall for Prodigy Myth :  In one of Paul Graham’s essays, he mentions about a girl who for some reason thinks that being a doctor is cool and grows up to become one. Sadly she never enjoys her work and bemoans that she is the outcome of a 12 year old’s fancy thought. So,in a way ,one should savor obscurity till it lasts. Those are the moments when you can really experiment with stuff, fall, get up and learn things.

Improving Skills:

  • Pay attention to Sweet Spots : If there are no sweet spots in your work, then that says your practice is really not useful.  The book describes a sweet spot sensation as, “Frustration, difficulty, alertness to errors. You’re fully engaged in an intense struggle— as if you’re stretching with all your might for a nearly unreachable goal, brushing it with your fingertips, then reaching again.” As melodramatic it might sound, some version of that description should be happening in your work, on a regular basis.
  • Take off your watch : Now this is a different message as compared to what every body seems to be saying (10000 hr rule).  The author says, “Deep practice is not measured in minutes or hours, but in the number of high-quality reaches and repetitions you make— basically, how many new connections you form in your brain.”  May be he is right. Instead of counting how many hrs you spent doing something, you got to count how many programs you have written or something to that effect. Personally I feel it is better to keep track of the time spent. Atul Gawande in his book, “Better”, talks about top notch doctors who have all one thing in common, they track the things they care about, be it number of operations, number of failures etc. You have got to count something, have some metric that summarizes your daily work.
  • Chunking : This is particularly relevant for a programmer. The smaller the code fragment, the more effective and elegant it works
  • Embrace Struggle :  Most of us instinctively avoid struggle, because it’s uncomfortable. It feels like failure. However, when it comes to developing your talent, struggle isn’t an option— it’s a biological necessity.
  • Choose five minutes a day over an hour a week :  With deep practice, small daily practice “snacks” are more effective than once-a-week practice binges. How very true.
  • Practice Alone : I couldn’t agree more.
  • Slow it down (even slower than you think): This is a difficult thing to do as we programmers are always in a hurry. Seldom we look back at the code that works!
  • Invent Daily Tests : Think of something new daily and code it up. Easier said than done. But I guess that’s what makes the difference between a good programmer and a great programmer

Sustaining Progress:

  • To learn it more deeply, teach it : I would say at least blog about it.
  • Embrace Repetition :  It seems obvious at first, but how many of us actively seek out repetition in our lives. We always seem to want novelty!
  • Think like a gardener, work like a carpenter :  Think patiently, without judgment. Work steadily, strategically, knowing that each piece connects to a larger whole. We all want to improve our skills quickly— today, if not sooner. But the truth is, talent grows slowly.

Out of the 52 tips mentioned in the book, I am certain that at least a few will resonate with anyone who is serious about improving his skills.


This book contains most of the productivity hacks that one comes across in various articles/blogs/books.  In one sense, this book is a laundry list of hacks that one can try out to increase productivity. A big font size for the text and rich images scattered through out the book, makes it a coffee table book.

Some of the hacks that I found interesting are,

  • On a daily basis, Try to use pen and paper at least a few minutes to work something out , be it a math problem or a back of the envelope calculation of something or simply draw images that capture whatever you are working on. Writing makes one’s relation to work intensely personal and more so using a pen and paper.
  • Keep a Swipe file – Your swipe file should contain good ideas and examples from your field of work or interest and from other fields. This hack is spot on for a programmer. You got to keep a folder that lists the tasks and the most efficient code that you have figured out for the task.
  • Force a image or word association with a number and vice-versa. This little hack is very useful for remembering stuff.
  • Declare a MAD – Massive Action Day :  Pick a day and focus on only one task. Switch off TV, Mobile, RSS alerts, email alerts, Internet etc. and work on something for 8-10 hours at a stretch.
  • Keep an “IDEAS BOX” – Twyla Tharp mentions this as a key to her successful career.
  • Make a NOT-TO-DO list. We always seem to know what we need to do and we don’t care or think about what not to do, in a day
  • Use time pods – 45 min time period where you focus on only one task. More on the lines of Pomodoro technique where you work on 25 min time slots.


This book by Michael Lewis delves in to the reasons behind the mysterious success of Oakland Athletics, one of the poorest teams in US baseball league . In a game where players are bought at unbelievable prices , where winning / losing is a matter of who’s got the bigger financial muscle, Oakland A’s go on to make a baseball history with rejected players and rookie players .

“Is their winning streak a result of random luck OR Is there a secret behind their winning streak ? “ , is a question that Michael Lewis tries to answer. Like a sleuth, the author investigates the system and the person behind the system , Oakland A’s manager Billy Beane.

The author traces of life of Billy Beane from his college days to the time when he becomes the manager of Oakland A’s. Billy is one of the most promising guys in his hey days and every one  believed that Billy was the superstar in the making. However Billy to everyone’s surprise doesn’t make it. He quits his job and takes up a desk job in the Oakland A’s team that is responsible to pick the talent and manage the team. Scouts, as they are called, are people who draft young and promising players in to the team. Billy in the course of time realizes that the methods followed by scouts are subjective, more gut feel based, touchy feely kind of criteria. Even some of the broad based metrics used to rate players appears vague to Billy. He sets on a mission to create metrics which truly reflect the value of the players. Billy relies on statistics, metrics to shortlist players and he does so in a manner that is similar to futures and options trader.

Much before Black-Scholes showed that the option can be replicated using a fractional unit of stock and bond, the option prices were way above the fair value. However once traders knew a rough picture of fair value, the difference was arbitraged away.  In the same way, the players belonging to the financially rich teams are overvalued in one way. Billy systematically chips away the broad metrics and tries to find ways to replicate these overvalued players with the help of new undervalued players, rookies who fit his metric criteria. So, if you think about the book from a finance perspective, Billy Bean is a classic arbitrage trader who shorts overvalued players and longs undervalued players. Infact the way in which Billy operates by selling and buying players is similar to any trading desk operation. He starts off the season with really average players and as the season proceeds, he starts evaluating the various team players available to trade , develops aggressive buy sell strategies ( of players) and creates a team which has statistically a higher chance to win. I found the book very engrossing as it is an splendid account of Billy Bean’s method of diligently creating a process that is system dependent rather than people dependent.

This book is being adapted in to a movie starring Brad Pitt as Billy Beane. It is slated to be released this September. If the movie manages to replicate at least 30-40% of the pace and content in this book, I bet the movie will be a run-away hit.

Next Page »