Technology


cropped-graphdatabases_cover390x5121

Neo4j’s founder Emil Eifrem shares a bit of history about the way Neo4j was started. Way back in 1999, his team realized that the database that was being internally used had a lot of connections between discrete data elements. Like many successful companies which grow out of a founder’s frustration with status quo, Neo4j began its life from the founding team’s frustration with the fundamental problem with the design of relational databases. The team started experimenting on various data models centered around graphs. Much to their dismay, they found no readily available graph database to store their connected data. Thus began the team’s journey in to building a graph database from scratch. Project Neo was born. What’s behind the name Neo4j ? The 4 in Neo4j does not stand for version number. All the versions numbers are appended after the word Neo4j. I found one folksy explanation on stackoverflow that goes like this,   

The Neo series of databases was developed in Sweden and attracted the ‘j’ suffix with the release of version 4 of the graph database. The ‘j’ is from the word ‘jätteträd’, literally "giant tree", and was used to indicate the huge data structures that could now be stored.

Incidentally the ‘neo’ portion of the name is a nod to a Swedish pop artist Linus Ingelsbo, who goes by the name NEO. In Neo1 the example graph was singers and bands and featured the favourite artists of the developers, including NEO. This was changed to the movie data set in the version 4 release.

There are other people speculating that Neo refers to "The Matrix" character Neo, fighting the "relational tables". It was recently announced that Neo4j would be called Neo5j as a part of latest branding exercise. In one of the recent blog post from the company, it was said that j in Neo4j stood for Java as the original graph database was writtein as a java library.

Preface

This talks about the purpose of the book, i.e to introduce graphs and graph databases to technology practitioners, including developers, database professionals, and technology decision makers. It also explains the main changes in the content of this book as compared to the first edition. The changes are mainly in the areas of Cypher syntax and modeling guidelines.

Introduction

A graph is nothing but a collection of edges and vertices. Having said that, there are many different ways in which the graph can be stored. One of the most popular form of graph model is the labeled property graph, which is characterized as

  • It contains nodes and relationships
  • Node contain properties(key-value pairs)
  • Nodes can be labeled with one or more labels
  • Relationships are names and directed, and always have a start and end node
  • Relationships can also contain properties

Rough Classification

Given the plethora of products in this space, it is better to have some sort of binning. The chapter bins the products in to two groups.

  1. Technologies used primarily for transactional online graph persistence, typically accessed directly in real time from an application
  2. Technologies used primarily for offline graph analytics, typically performed as a series of batch steps

Broader Classification

The chapter also says that one of the other ways to slice the graph space is via graph data models, i.e. property graph, RDF triples and hypergraphs. In the appendix for the book, there is a much broader classification given for the NOSQL family of products. I will summarize the contents of appendix in this section as I found the appendix very well written.

Rise of NOSQL:

The NOSQL movement has mainly arisen because of 3 V’s

  • Volume of data : There has been a deluge of semi-structured data in the recent decade and storing it all in an structured relational data format has been fraught with difficulties. Storing connections gives rise to complicated join queries for CRUD operations.
  • Velocity of data read and writes and schema changes
  • Variety of data

Relational databases are known to provide ACID transactions; Atomic, Consistent, Isolated, Durable. NOSQL databases instead have BASE properties; Basic availability, Soft-state, Eventual consistency. Basic availability means that the data store appears to work most of the time. Soft-state stores don’t have to be write-consistent, nor do replicas have to be mutually consistent all the time. Eventual consistency stores exhibit consistency at some later point.

Types of NOSQL Databases:

NOSQL databases can be divided in to 4 types, i.e.

  • Document Stores :Document databases store and retrieve documents, just like an electronic filing cabinet. These stores act like a key-value pair with some sort of indexing in place for quicker retrieval. Since the data model is one of disconnected entities, stores tend to scale horizontally. Also writes are atomic at a document level. Technology has not matured for writes across multiple documents. MongoDB and CouchDB are two examples of Document stores.
  • Key-Value Stores :Key-value stores are cousins of the document store family, but their lineage comes from Amazon’s Dynamo database. They act like large, distributed hashmap data structures that store and retrieve opaque values by key. A client stores a data element by hashing a domain-specific key. The hash function is crafted such that it provides a uniform distribution across the available buckets, thereby ensuring that no single machine becomes a hotspot. Given the hashed key, the client can use that address to store the value in a corresponding bucket. These stores are similar to document stores but offer a higher level of insight in to the data stores. Riak-for instance- also offers visibility into certain types of structured data. In any case, these stores are optimized for high availability and scale.Riak and Redis are two example of Key-Value stores.
  • Column Family : These stores are modeled on Google’s BigTable. The data model is based on sparsely populated table whose rows can contain arbitrary columns, the keys which provide natural indexing. The four building blocks of Column family databasestore are column, super column, column family and super column family. HBase is an example of Column-oriented database.
  • Graph Databases : All the three previous types of databases are still aggregate stores. Querying them for insight into data at scale requires processing by some external application infrastructure. Aggregate stores are not built to deal with highly connected data This is where Graph databases step in. A graph database is an online, operational database management with CRUD methods that expose a graph model. Graph databases are generally built for use with transactional systems.There are two properties that are useful to understand while investigating graph technologies  
    • The underlying storage: Some graph databases use "native graph storage", which is optimized and designed for storing and managing graphs.Not all graph database technologies use native graph storage. Some serialize the graph data in to relational database,OOPS database etc
    • Processing engine : Some graph databases are capable of index-free adjacency, i.e. nodes point to each other in the database. For graphs that use index-free processing, the authors use the term "native graph processing"

    Besides adopting a specific approach to storage and processing, Graph databases also adopt a specific model. There are various models such as property graphs, hypergraphs and triples. Hypergraphs are mainly useful for representing many-to-many relationships. Triple stores typically provide SPARQL capabilities to reason about and store RDF data. Triple stores generally do not support index-free adjacency and are not optimized for storing property graphs. To perform graph queries, triple stores must create connected structures from independent facts, which adds latency to each query. Hence the sweet sport for a triple store is analytics, where latency is a secondary consideration, rather than OLTP.  

Power of Graph Databases:

The power of graph databases lies in performance, agility and flexibility. Performance comes from the fact that search can be localized to a portion of graph. Flexibility comes from the fact that there is little impedance between the way business communicates the requirements and the way the graph is modeled. Agility comes from the fact that graph databases are schema less and hence all the new connections can easily be accommodated.

Options for Storing Connected Data

This chapter takes a simple example of modeling friends and friend of friends relations via a RDBMS. SQL code is given to extract a few basic questions an user might have, while looking at a social graph. The queries, even for simple questions, become very complex. It quickly becomes obvious to anyone reading this chapter that modeling connections via RDMBS is challenging, for the following reasons:

  • Join tables add accidental complexity
  • Foreign key constraints add additional development and maintenance overhead
  • Sparse tables with nullable columns require special checking in code
  • Several expensive joins are needed to discover nested relationships
  • Reverse queries are difficult to execute as the SQL code becomes extremely complicated to write and maintain

NOSQL databases are also not scalable for storing connected data. By the very nature of document stores/key-value stores/ columnar family, the lack of connection or relation as a first class object makes it difficult to achieve scale. Even though there is some sort of indexing available in most of the NOSQL databases, they do not have index free adjacency feature. This implies that there is a latency in querying connected data.

The chapter ends with showcasing a graph database to store connections and it is abundantly clear by this point in the chapter that, by giving relationship the status of first class object, graph databases make it extremely easy to represent and maintain connected data. Unlike RDBMS and NOSQL stores in which connected data necessitates developers to write data processing logic in the application layer, graph databases offer a convenient and a powerful alternative.

Data Modeling with Graphs

The path from a real life representation of data to a data model in the graph database world is short. In fact one can use almost the same lingo to talk about various aspects while referring to graph data. This is unlike the RDBMS world where one needs to translate the real life representation to a logical model via "normalization-denormalization" route. In the latter world, there is a semantic dissonance between conceptualization of a model and database’s instantiation of that model

The chapter begins by introducing Cypher, an expressive graph database query language. Cypher is designed to be easily read and understood by developers, database professionals, and business stakeholders. Its ease of use derives from the fact that it is in accord with the way we intuitively describe graphs using diagrams. Cypher is composed of clauses like most other query languages. The chapter introduces some of the main clauses such as MATCH, RETURN, WHERE, CREATE, MERGE, DELETE, SET, FOREACH, UNION, WITH, START.

There are two examples used in the chapter that serve to show the contrast in the steps involved in modeling data in the relational world and graph world. In the relational world, one takes the white-board model, creates an ER diagram, and then a normalized representation of the model and finally a denormalized form. Creating a normalized representation involves incorporating foreign key relationships to avoid redundancy. Once the database model has been adapted to a specific domain, there is an additional denormalization work that needs to be done so as to suit the database and not the user. This entire sequence of steps creates an impedance between real life representation and the data base representation. One of the justifications given to this effort expended is that it is one time investment and it will payoff as the data grows. Sadly, given that data has 3V’s in the current age, i.e. Volume, Velocity and Variety, this argument no longer holds good. It is often seen that RDBMS data models undergo migrations with changing data structures. What once seemed to be a solid top-down robust approach falls apart quickly. In the graph world, the effort put in translating a white-board model in to a database model is minimal, i.e. what you sketch on a keyboard is what you store in the database. These two examples mentioned illustrate the fact that conceptual to implementation dissonance is is far less for graph database than an RDBMS.

The chapter also gives some guidelines in creating graphs so that they match up the BASE properties of NOSQL databases. By the end of the chapter, a reader ends up getting a good understanding of Cypher query language. If one is a complete newbie to Neo4j, it might be better to take a pause here and experiment with Cypher before going ahead with the rest of the book.

Building a Graph Database application

The basic guidelines for data modeling are

  • Use nodes to represent entities – that is, the things in our domain that are interesting to us
  • Use relations to represent relations between entities and to establish a semantic context for each entity
  • Use relationship direction to further clarify relationship semantics. Many relationships are asymmetric and hence it makes sense to always represent a relation with a certain direction
  • Use node attributes to represent entity attributes, plus any entity meta-data
  • Use relationship attributes to represent the strength, weight, quality of the relationship etc.
  • While modeling a node, make sure that it does not connect two nodes
  • It is always better to represent both a fine grained and a coarse grained relationship between two nodes. This helps in quickly querying at a coarse grained level.
  • The key idea of using graph as a way to model the data is that one can do an iterative model for creating nodes and relationships

Application Architecture

There are two ways in which Neo4j can be used in an application. One is the embedded-server mode and the rest is REST API mode.

Clustering

All writes to Neo4j can be written to the master or a slave. This capability at an application level is an extremely powerful feature. These kind of powerful features are not present in other graph databases.

Load Balancing

There is no native load balancing available in Neo4j. However it is up to the network infra to create load balancing. One needs to introduce a network level rebalancer to separate out the read and write queries. Infact Neo4j exposes an API call to indicate whether the server is a master or slave.

Since cache partitioning is an NP Hard problem, one can use a technique called "cache-sharding" to route the query to that specific server that has a highest probability of finding a cache to serve the query. The way to route and cache queries is dependent on the domain in which the graph database is being developed.

Testing

If you need to debug any function written in any language, one needs sample input. In the case of testing graph, one needs to create a small graph of a few dozen nodes and relationships, so that this localized graph can be used to check any anomalies in the graph model. It is always better to write a lot of simple tests checking various parts of the graph than relying on one single universal test for the entire graph. As the graph evolves, an entire set of regression framework of tests can be formulated.

Importing and Bulk Loading Data

Most of the deployments of any kind of database requires some sort of content to be imported in to the database. This section of the chapter talks about importing data from a csv format. The headers in the csv file should be specified in such a way that they reflect whether the pertinent columns is an ID/LABEL/TYPE/property. Once the csv files are in the relevant format, they can be imported via neo4j-import command. One can also do a batch import via Cypher queries using LOAD CSV. In my line of work, I deal with a lot of RDF input files and there is an open source stored procedure that one can use to import any large RDF in to Neo4j.

Usually doing a bulk import of any file should be preceded by index creation process. Indexes will make lookups faster during and after the load process. Creating indexes on any type of is quite straightforward via Cypher commands. If the indexing is only helpful during data loading process, one can delete the index after the bulk data load is completed. For large datasets, it is suggested that one uses PERIODIC COMMIT command so that the transactions are committed only after a certain set of rows have been processed.

Graphs in the Real World

Why Organizations choose Graph Databases ?

Some of the reasons that organizations choose graph databases are :

  • "Minutes to Millisecond" performance requirement – Using an index free adjacency, a graph database turns complex joins into fast graph traversals, thereby maintaining millisecond performance irrespective of the overall size of the dataset
  • Drastically accelerated development cycles : Graph model reduces the impedance mismatch between the technical and the business domains
  • Extreme business responsiveness- Schema-free nature of the graph database coupled with the ability to simultaneously relate data elements in lots of different ways allows a graph database solution evolve as the business needs evolve.
  • Enterprise ready -The technology must be robust, scalable and transactional. There are graph databases in the market that provide all the -ilities, i.e transactionability, high-availability, horizontal read scalability and storage of billions of triples

 Common Use cases

  The chapter mentions the following as some of the most common usecases in the graph database world:

  1. Social Networks Modeling
  2. Recommendations (Inductive and Suggestive)
  3. Geospatial data
  4. Master Data Management
  5. Network and Data Center Management
  6. Authorization and Access Control

Graphs in the real world

There are three usecases that are covered in detail. Reading through these usecases gives one a good introduction of various types of data modeling aspects including writing good cypher queries. Reading through the Cypher queries pre-supposes that the reader is slightly comfortable with Cypher. To get the best learning out of these examples, it is better to write out the Cypher queries before looking at the constructed queries. One gets an immediate feedback about the way to construct efficient Cypher queries.

Graph Database Internals

This chapter gives a detailed description of Neo4j graph database internals.

neo4jarch - Copy

Native Graph Processing

A Graph database has native processing capabilities if it exhibits a property called index-free adjacency. A database engine that utilizes index-free-adjacency is one that maintains direct reference to its adjacent nodes. This enables the query times to be independent of the graph size.

A nonnative graph database uses global indexes to link nodes. These indexes add a layer of indirection to each traversal, thereby incurring computational costs. To traverse a network of m steps, the cost of indexed approach is O(m log n) whereas the cost is O(m) for an implementation that uses index-adjacency. To create a index-free adjacency in graphs, it is important to create an architecture that supports this property. Neo4j has painfully built this over a decade and one can see their efforts by querying a large graph. In my work, I have worked with pure triple stores and some sort of global indexing built on the triple stores. The performance of Neo4j is zillion times better than the databases that I have worked on. Hence a big fan of Neo4j.

Native Graph Storage

Neo4j stores data in a number of different file formats. There are separate data files for nodes, relationships and properties. Node store is a fixed size record store.Each node record is 9 bytes in length. The Node ID is created in such a way that node lookup is O(1) instead of O(log n). The constituents of the node record are 1) ID of the first relationship, 2) ID of the first property 3) label store of the node. Relationships are also stored in fixed length records whose constituents are 1) start Node ID 2) end Node ID 3) pointer to the previous relationship and pointer to the next relationship.

With fixed-sized records, traversals are implemented simply by chasing the pointers around a datastructure. The basic idea of searching in a Neo4j boils down to locating the first record in the relationship chain and then chasing various pointers. Apart from this structure, Neo4j also has a caching feature that makes increases its performance.

Programmatic APIs

Neo4j exposes four types of API for a developer and they are

  1. Kernel API
  2. Core API
  3. Traverser API

The Core API allows developers to fine-tune their queries so that they exhibit high affinity with the underlying graph. A well-written Core API query is often faster than any other approach. The downside is that such queries can be verbose, requiring considerable developer effort. Moreover, their high affinity with the underlying graph makes them tightly coupled to its structure. When the graph structure changes, they can often break. Cypher can be more tolerant of structural changes things such as variable-length paths help mitigate variation and change. The Traversal Framework is both more loosely coupled than the Core API (because it allows the developer to declare informational goals), and less verbose, and as a result a query written using the Traversal Framework typically requires less developer effort than the equivalent written using the Core API. Because it is a general-purpose framework, however, the Traversal Framework tends to perform marginally less well than a well-written Core API query

Nonfunctional characteristics

Evaluation

One of the ways to evaluate the performance of database is via 1) the number of transactions that can be handled in an ACID way and 2) the number of read and write queries that can be processed on a database.

Transactions are semantically identical to traditional database transactions. Writes occur with in the same transaction context, with write locks being taken for consistency purposes on any nodes and relationships involved in the transaction. The following talks about the way transaction is implemented

The transaction implementation in Neo4j is conceptually straightforward. Each transaction is represented as an in-memory object whose state represents writes to the database. This object is supported by a lock manager, which applies write locks to nodes and relationships as they are created, updated, and deleted. On transaction rollback, the transaction object is discarded and the write locks released, whereas on successful completion the transaction is committed to disk.

Scale

For a platform like Neo4j, one cannot talk of scale based on # of transactions per second. One needs to think about scale along atleast three different axis.

  • Capacity: The current release of Neo4j can support single graphs having tens of billions of nodes, relationships, and properties.The Neo4j team has publicly expressed the intention to support 100B+ nodes/relationships/properties in a single graph as part of its road map
  • Latency: The architecture of Neo4j makes the performance almost constant irrespective of the size of the graph. Most queries follow a pattern whereby an index is used to find a starting node and the remainder is pointer chasing. Performance does not depend on the size of the graph but depends on the data being queried.
  • Throughput: Neo4j has a constant time performance irrespective of the graph size. We might think that extensive read writes will make the performance of the entire database to go down. However one sees that the typical read, write queries happen on a localized graph and hence there is scope to optimize at an application level.

Holy grail of scalability

The future goal of most graph databases is to be able to partition a graph across multiple machines without application-level intervention, so that read and write access to the graph can be scaled horizontally. In the general case this is known to be an NP Hard problem, and thus impractical to solve.

Predictive Analysis with Graph Theory

The chapter starts off with a basic description of depth-first and breadth-first graph traversal mechanisms. Most useful algorithms aren’t pure breadth-first but are informed to some extent. Breadth-first search underpins numerous classical graph algorithms, including Dijkstra’s algorithm. The last chapter talks about several graph theoretical concepts that can be used to analyze Networks.

takeawayTakeaway

The book gives a good introduction to various aspects of any generic graph database, even though most of the contents are Neo4j specific. This book is 200 pages long but it packs so much content in it that one needs to read slowly to understand thorughly various aspects of a graph databases. If you have been exposed to native triple stores,then this book will be massively useful to get an idea about Neo4j’s implementation of property-graph model – an architecture that makes all the CRUD operations on a graph database very efficient.

Advertisements

learningcypher

Cypher is a query language for Neo4j graph database. The basic model in Neo4j can be described as

  • Each node can have a number of relationships with other nodes
  • Each relationship goes from one node either to another node or to the same node
  • Both nodes and relationships can have properties, and each property has a name and a value

Cypher was first introduced in Nov 2013 and since then the popularity of graph databases as a category has taken off. The following visual shows the pivotal moment:

cypherintro

Looking at the popularity of Cypher, Neo4j was made open source in October 2015. Neo4j founders claim that the rationale behind the decision was that a common query syntax could be followed across all the graph databases. Cypher provides a declarative syntax, which is readable and powerful and a rich set of graph patterns can be recognized in a graph.

Via Neo4j’s blog:

Cypher is the closest thing to drawing on a white board with a keyboard. Graph databases are whiteboard friendly; Cypher makes them keyboard friendly.

Given that Cypher has become open source and has the potential to become the de facto standard in graph database segment, it becomes important for anyone working with graph data to have a familiarity with the syntax. Since the syntax looks like SQL syntax, has some pythonic element to the query formulation, it can be easily picked up by reading a few articles on it. Do you really need a book for it ? Not necessarily. Having said that, this book reads like a long tutorial and is not dense. It might be worth one’s time to read this book to get a nice tour of various aspects of Cypher.

Chapter 1 : Querying Neo4j effectively with Pattern Matching

Querying a graph database using API is usually very tedious. I have had this experience first hand while working on a graph database that had ONLY API interface to obtain graph data. SPARQL is a relief in such situations but SPARQL has a learning curve. I would not call it steep, but the syntax is a little different and one needs to get used to thinking in triples, for writing effective SPARQL queries. Writing effective SPARQL queries entails thinking in subject-predicate-object terms. Cypher on the other hand is a declarative query language, i.e. it focuses on the aspects of the result rather than on methods or ways to get the result. Also it is human-readable and expressive

The first part of the chapter starts with instructions to set up a new Neo4j instance. Neo4j server can be run as a standalone machine with the client making API calls OR can be run as an embedded component in an application. For learning purpose, working with standalone server is the most convenient option as you have a ready console to test out sample queries. The second part of the chapter introduces a few key elements of Cypher such as

  • MATCH
  • RETURN
  • () for nodes
  • [] for relations
  • -> for directions
  • – for choosing bidirectional relations
  • Filtering matches via specifying node labels and properties
  • Filtering relationships via specifying relationship labels and properties
  • OPTIONAL to match optional paths
  • Assigning the entire paths to a variable
  • Passing parameters to Cypher queries
  • Using built in functions such as allShortestPaths
  • Matching paths that connect nodes via a variable number of hops

Chapter 2 : Filter, Aggregate and Combine Results

This chapter introduces several Cypher statements that can be used to extract summary statistics of various nodes and relationships in a graph. The following are the Cypher keywords explained in this chapter

  • WHERE for text and value comparisons
  • IN to filter based on certain values
  • "item identifier IN collection WHERE rule" pattern that can be used to work with collections. This pattern is similar to list comprehension in python
  • LIMIT and SKIP for pagination purposes. The examples do not use ORDER BY which is crucial for obtaining paginated results
  • SORT
  • COALESCE function to work around null values
  • COUNT(*) and COUNT(property value) – Subtle difference between the two is highlighted
  • math functions like MIN, MAX, AVG
  • COLLECT to gather all the values of properties in a certain path pattern
  • CASE WHEN ELSE pattern for conditional expressions
  • WITH to separate query parts
  • UNION and UNION ALL

Chapter 3 : Manipulating the Database

This chapter talks about Create, Update and Delete operations on various nodes and relations. The Cypher keywords explained in the chapter are

  • CREATE used to create nodes, relationships and paths
  • CREATE UNIQUE
  • SET for changing properties and labels
  • MERGE to check for an existing pattern and create the pattern if it does not exist in the database
  • MERGE SET and MERGE CREATE for setting properties during merge operations
  • REMOVE for removing properties and labels
  • DELETE
  • FOREACH pattern to loop through nodes in a path

By the end of this chapter, any reader should be fairly comfortable in executing CRUD queries. The queries comprise three phases

  1. READ : This is the phase where you read data from the graph using MATCH, OPTIONAL, and MATCH clauses
  2. WRITE : This is the phase where you modify the graph using CREATE, MERGE, SET and all other clauses
  3. RETURN : This is the phase where you choose what to return to the caller

Improving Performance

This chapter mentions the following guidelines for creating queries in Neo4j :

  • Use Parametrized queries: Wherever possible, write queries with parameters that allows engine to reuse the execution of the query. This takes advantage of the fact the Neo4j engine can cache the query
  • Avoid unnecessary clauses such as DISTINCT based on the background information of the graph data
  • Use direction wherever possible in match clauses
  • Use a specific depth value while searching for varying length paths
  • Profile queries so that the server does not get inundated by inefficient query construction
  • Whenever there is large number of nodes belonging to a certain label, it is better to create index. In fact while importing a large RDF it is always better to create indices on certain types of nodes.
  • Use constraints if you are worried about property redundancy

Chapter 4 :  Migrating from SQL

The chapter talks about various tasks involved in migrating data from a RDBMS to a graph database. There are three main tasks in migrating from SQL to a graph data base :

  1. Migrating the schema from RDBMS to Neo4j
  2. Migrating the data from tables to Neo4j
  3. Migrating queries to let your application continue working

It is better to start with an ER diagram that is close to the white-board representation of the data. Since graph databases can closely represent a white-board than the Table structure mess(primary key, foreign key, cardinality), one can quickly figure out the nodes and relationships needed for the graph data. For migrating the actual data, one needs to import the data in to relevant CSV and load the CSV in to Neo4j. The structure of various CSV files to be generated depends on the labels, nodes, relationships of the graph database schema. Migrating queries from RDMBS world in to graph database world is far more easier as Cypher is a declarative syntax. It is far quicker to code the various business requirement queries using Cypher syntax.

Chapter 5 : Operators and Functions

The last section of the book contains a laundry list of operators and functions that one can use in creating a Cypher query. It is more like a cheat sheet but with elaborate explanation of various Cypher keywords

takeawayTakeaway

This book gives a quick introduction to all the relevant keywords needed to construct a Cypher query. In fact it is fair to describe the contents of the book as a long tutorial with sections and subsections that can quickly bring a Cypher novice up to speed.

book_cover

This book gives a a macro picture of  machine learning. In this post, I will briefly summarize the main points of the book. One can think of this post as a meta summary as the book itself is a summary of all the main areas of machine learning.

 

Prologue

Machine learning is all around us, embedded in technologies and devices that we use in our daily lives. They are so integrated with our lives that we often do not even pause to appreciate its power. Well, whether we appreciate or not, there are companies harnessing the power of ML and profiting from it. So, the question arises, whether we need to care about ML at all ? When a technology becomes so pervasive, you need to understand a bit. You can’t control what you don’t understand. Hence at least from that perspective, having a general overview of the technologies involved, matters.

Machine learning is something new under the sun: a technology that builds itself. The artifact in ML is referred to as a learning algorithm. Humans have always been designing artifacts, whether they are hand built or mass produced. But learning algorithms are artifacts that design other artifacts. A learning algorithm is like a master craftsman: every one of its productions is different and exquisitely tailored to the customer’s needs. But instead of turning stone in to masonry or gold in to jewelry, learners turn data into algorithms. And the more data they have, the more intricate the algorithms can be.

At its core, Machine learning is about prediction: predicting what we want, the results of our actions, how to achieve our goals, how the world will change.

The author says that he has two goals in writing this book:

  • Provide a conceptual model of the field,i.e. rough knowledge so that you can use it effectively. There are many learning algorithms out there and many are being invented every year. The book provides an overview of learning algos by categorizing the people who use them. The author calls each category a tribe. Each tribe has its own master algorithm for prediction
    • Symbolists: They view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic.
      • Master algorithm for this tribe is inverse deduction
    • Connectionists: They reverse engineer the brain and are inspired by neuroscience and physics.
      • Master algorithm for this tribe is backpropagation
    • Evolutionaries: They simulate evolution on the computer and draw on genetics and evolutionary biology.
      • Master algorithm for this tribe is genetic programming
    • Bayesians: They believe that learning is a form of probabilistic inference and have their roots in statistics.
      • Master algorithm for this tribe is Bayesian inference
    • Analogizers: They learn by extrapolating from similarity judgments and are influenced by psychology and mathematical optimization.
      • Master algorithm for this tribe is Support vector machines.

In practice each of the master algorithms that a particular tribe uses is good for some type of problems but not for others. What we really want is a single algorithm combining the key features of all of them, The Master Algorithm

  • Enable the reader to invent the master algorithm. A layman, approaching the forest from a distance, is in some ways better placed than the specialist, already deeply immersed in the study of particular trees. The author suggests the reader to pay attention to each tribe and get a general overview of what each tribe does and what tools that each tribe uses. By viewing each tribe as a piece of a puzzle, it would be easy to get a conceptual clarity of the entire field as a whole.

The Machine Learning Revolution

An algorithm is a sequence of instructions telling a computer what to do. The simplest algorithm is : flip a switch. The second simplest algorithm is : combine two bits. The idea connecting transistors and reasoning was understood by Shannon and his masters thesis lead to the most important scientific discipline – information theory. Computers are all about logic. Flipping a set of transistors is all what any algorithm does. However behind this benign activity, some of the most powerful algorithms go about doing their work by using some preexisting algos as building blocks. So, if computing power increases, do all the algos automatically become efficient and all powerful? Not necessarily. The serpent in the garden goes by the name "complexity monster". There are many heads to this complexity monster.

  • Space complexity : the number of bits of info that an algo needs to store in the computer’s memory
  • Time complexity : how long the algo takes to run
  • Human complexity : when algos become too complex, humans cannot control them and any errors in the algo execution causes panic.

Every algorithm has an input and an output – the data goes into the computer, the algo does the job and gives out the result. Machine learning turns this around : in goes the data and the desired result and out comes the algorithm that turns one in to the other. Learning algorithms – learners- are algorithms that make other algorithms. With machine learning, computers write their own programs. The author uses a nice low tech analogy to explain the power of machine learning :

Humans have known a way to get some of the things they need by letting nature make them. In farming, we plant the seeds, make sure they have enough water and nutrients, and reap the grown crops. The promise of machine learning is that the technology can mirror farming. Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.

This analogy makes two things immediate. First the more data we have, the more we can learn. Second, ML is a sword that can slay the complexity monster. If one looks at any learning algorithm, it does broadly two things. It either learns knowledge or learns some skills, i.e. it learns knowledge in terms of statistical models or learns procedures that underlie a skill. The author talks about another analogy in the information eco-system. He equates databases, crawlers, indexers and so on as herbivores, patiently munging on endless fields of data. Statistical algos, online analytical processing are the predators and learning algorithms are the super predators. Predators turn data in to information and Super predators turn information in to knowledge.

Should one go and spend years in computer science discipline to become ML expert ? Not necessarily. CS makes you think deterministically and ML needs probabilistic thinking. The difference in thinking is a large part of why Microsoft has had a lot more trouble catching up with Google than it did with Netscape. A browser is just a standard piece of software, but a search engine requires a different mind-set.

If you look at all the fantastic stuff behind ecommerce applications, it is all about match making; producers of information are being connected to consumers of information. In this context ,learning algorithms are the match makers. The arsenal of learning algos will serve as a key differentiator between the companies. Data is indeed the new oil.

The chapter ends with a discussion of how political campaigning is being revolutionized by ML. With truck loads of data at each voter level, politicians are turning to ML experts to help them with campaign analysis and directed targeting. The author predicts that in the future, machine learning will cause more elections to be close. What he means by that is that learning algorithms will be ultimate retail politicians.

 

The Master Algorithm

If you look at any of the algorithms such as nearest neighbor, decision tree, naive Bayes etc., they are all domain agnostic. This warrants a question, "Can there be one learner that does everything, across all domains?". If you have used let’s say a frequentist way of estimating a parameter and subsequently check it with a Bayesian inference, then both parameters give almost the same distribution if there is a LOT of data. Under a deluge of data, you will most certainly get a convergence of parameters in a model, irrespective of the model used. However the reality is that there is a scarcity of data, that necessitates making assumptions. Depending on the type of problem and depending on the type of assumption, certain class of learning models are better than the rest. If one speculates the presence of a master algorithm, then the assumptions also need to go in as input. This chapter explores the presence of a master algorithm that can be learnt from input data, output data and the assumptions. The author grandly states the central hypothesis of the book as

All knowledge-past, present, and future-can be derived from data by a single, universal learning algorithm.

The author speculates the presence of a master algorithm and says that there are arguments from many fields that speculate the presence of one.

The argument from Neuroscience : The evidence from the brain suggests that it uses the same learning algorithm, throughout, with the areas dedicated to the different senses distinguished only by the different inputs they are connected to. In turn, the associative areas acquire their function by being connected to multiple sensory regions, and the "executive" areas acquire theirs by connecting the associative areas and motor output.

If you look at any cortex in a human brain, you find that the wiring pattern is similar. The cortex is organized in to six different layers, feedback loops, short range inhibitory connections and long-term excitatory connections. This pattern is very much common across the brain. There is some variation in the patterns but the structure pretty much is the same. The analogy is that the it is the same algo with different parameters and settings. Low-level motor skills are controlled by cerebellum that has a clearly different and regular architecture. However experiments have shown that cerebellum can be perfectly replaced by cortex. So, this again goes to suggest that there is some "master algorithm" controlling the entire brain.

If you look at a set of learning algos in the ML field, you can infer that at some level they are trying to reverse engineer the brain’s function. One of the five ML tribes, Connectionists, believe in this way of modeling the world.

The argument from Evolution : If one looks at the evolution of species since the beginning of earth, one can think of natural selection or whatever the nature does as an algorithm. This algorithm has all the species as inputs and the species at any given point in time as the output. The master algo does the work of eliminating certain species, allowing certain species to mutate, etc. It is a dynamic process where the outputs are again fed as inputs. This line of thought makes one speculate the presence of a master algorithm. In fact one of the five tribes in the ML world, Evolutionaries, strongly believe in this way of modeling.

The argument from Physics : Most of the physics is driven by simple equations that prune away all the noise in the data and focus on the underlying beauty. Physics laws discovered in one domain are seamlessly applied to other domains. If everything we see in the nature could be explained by few simple laws, then it makes sense that a single algorithm can induce all the can be induced. All the Master Algorithm has to do is provide a shortcut to the laws’ consequences, replacing impossibly long mathematical derivations with much shorter ones based on actual observations. Another way to look at scientific disciplines is to think of various laws, states as outcomes of an dynamic optimization problem. However physics is unique in its simplicity. It’s only reasonably effective.

Machine learning is what you get when the unreasonable effectiveness of mathematics meets the unreasonable effectiveness of data.

The argument from Statistics : Bayesians look at the world from a probabilistic and learning mindset. Bayes rule is a recipe to turn data in to knowledge. In the yesteryears, Bayes applications were confined to simple applications. However with the rise of computing power, Bayes applications are now being used to a wide range of complex modeling situations. Is Bayes the master algorithm ? Well, there seems to many critics to the Bayesian approach of modeling. All said and done, it definitely appears that Bayesian inference will be a part of "Master Algorithm" in some way.

The argument from computer science : The author mentions the famous unsolved problem in computer science, P vs. NP. NP-complete are a set of problems that are equivalent in their computational hardness. If you solve one of the problems, you solve the rest. For decades, many mathematicians, researchers and scientists have been finding clever tricks to almost solve NP-complete problems. But the fundamental problem still eludes us -  Is the class of problems for which we can efficiently compute same as the class of problems for which we can efficiently check whether a solution exists ? If you read up on this problem, you will realize that all the NP-complete problems can be reduced to a satisfiability problem. If we invent a learner that can learn to solve satisfiability, it has a good claim to being the Master Algorithm.

NP-completeness aside, the sheer fact that a computer can do a gazillion tasks should make one confident about speculating the presence of a master algorithm that does the job across several problems. The author uses the example of Turing machine and says that back then, it was unthinkable to actually see a Turing machine in action. Turing machine can solve every conceivable problem that can be solved by logical deduction. The fact that we see these machines everywhere means that, despite the odds, we might see a Master Algorithm sometime in the future.

The Master Algorithm is for induction, the process of learning, what the Turing machine is for deduction. It can learn to simulate any other algorithm by reading examples of its input-output behavior. Just as there are many models of computation equivalent to a Turing machine, there are probably many different equivalent formulations of a universal learner. The point, however, is to find the first such formulation, just as Turing found the first formulation of the general-purpose computer.

Interesting analogy : The author is of the opinion that "human intuition" can’t replace data. There have been many instances where human intuition has gone terribly wrong and a guy with lots of data has done better. The author uses Brahe, Kepler and Newton’s work to draw a parallel to the machine learning.

Science goes through three phases, which we can call the Brahe, Kepler, and Newton phases. In the Brahe phase, we gather lots of data, like Tycho Brahe patiently recording the positions of the planets night after night, year after year. In the Kepler phase, we fit empirical laws to the data, like Kepler did to the planets’ motions. In the Newton phase, we discover the deeper truths. Most science consists of Brahe-and-Kepler-like work; Newton moments are rare. Today, big data does the work of billions of Brahes, and machine learning the work of millions of Keplers. If-let’s hope so-there are more Newton moments to be had, they are as likely to come from tomorrow’s learning algorithms as from tomorrow’s even more overwhelmed scientists, or at least from a combination of the two.

Critics of Master Algo : Well, for a concept as ambitious as Master Algo, there are bound to be critics and there are many. The author mentions a few of them as examples,

  • Knowledge engineers
  • Marvin Minsky
  • Naom Chomsky
  • Jerry Fodor

Hedgehog or Fox : One of the other questions that comes up when we think of "Master Algorithm" is whether it is a fox or a hedgehog. There are many studies that have shown that being a fox is far better than being a hedgehog. The hedgehog is synonymous with that of an "expert". In the context of this book though, a learning algorithm can be considered as "hedgehog" if variations of it can solve all the learning problems. The author hopes that the "Master Algorithm" turns out to a hedgehog.

Five tribes of ML : In the quest for master algorithm, we do not have to start from scratch. There are already many decades of ML research underway and each type of research community is akin to a tribe. The author describes five tribes, i.e. symbolists, connectivists, Bayesians, evolutionaries and analogizers. Each of the tribes uses its own master algo. Here is an illustration from the author’s presentation

fivetribes

master-algo

But the real master algo that the author is hinting at is an algo that combines all the features of the tribes.

The most important feature of each of the tribe is that they firmly believe that theirs is the only way to model and predict. Unfortunately, this thinking hinders their ability to model a broad set of problems. For example, a Bayesian would find it extremely difficult to leave the probabilistic inference method and look at the problem from a evolutionary point of view. His thinking is forged based on priors, posteriors and likelihood functions. If a Bayesian were to look at an evolutionary algorithm like a genetic algo, he might not critically analyze it and adapt it to the problem at hand. This limitation is prevalent across all tribes. Analogizers love support vector machines but there are limited because they look for similarities of inputs across various dimensions; i.e. they are bound to be hit by curse of dimensionality. The same serpent,"the curse of dimensionality" that the author talks about in the previous chapters comes and bites each tribe, depending on the type of problem being solved.

The obvious question that arises in a reader’s mind is, can there be combination of tribes that come together to solve a specific set of problems ? Indeed the tribe categorization is not a hard categorization of the algorithms. It is just meant as a starting point so that you can place the gamut of algos in separate buckets.

 

Hume’s problem of Induction

The chapter starts with a discussion of "Rationalism vs. Empiricism". The rationalist likes to plan everything in advance before making the first move. The empiricist prefers to try things and see how they turn out. There are philosophers who strongly believe in one and not in the other. From a practical standpoint, there have been productive contributions to our world from both the camps. David Hume is considered to be one of the greatest empiricist of all time. In the context of Machine Learning, one of his questions has hung like a sword of Damocles over all the knowledge, which is,

How can we ever be justified in generalizing from what we’ve seen to what we haven’t?

The author uses a simple example where you have to decide to ask someone out for a date or not. The dataset used in the example illustrates Hume’s problem of induction, i.e. there is no reason to pick one generalization over another. So, a safe way out of the problem, at least to begin with, is to assume that future will be like the past. Is this enough ? Not really. In the ML context, the real problem is : How to generalize to cases that we haven’t seen before. One might think that by amassing huge datasets, you can solve this problem. However once you do the math, you realize that you will run out of data that covers all the cases needed to carry the inductive argument safely. Each new data point is most likely unique and you have no choice but to generalize. According to Hume, there is no way to do it

If this all sounds a bit abstract, suppose you’re a major e-mail provider, and you need to label each incoming e-mail as spam or not spam. You may have a database of a trillion past e-mails, each already labeled as spam or not, but that won’t save you, since the chances that every new e-mail will be an exact copy of a previous one are just about zero. You have no choice but to try to figure out at a more general level what distinguishes spam from non-spam. And, according to Hume, there’s no way to do that.

The "no free lunch" theorem : If you have been reading some general articles in the media on ML and big data, it is likely that you would have come across a view on the following lines:

With enough data, ML can churn out the best learning algo. You don’t have to have strong priors, the fact that you have large data is going to give you all the power to understand and model the world.

The author introduces David Wolpert’s "no free lunch" theorem that a limit on how good a learning algorithm can be. The theorem says that no learner can be better than random guessing. Are you surprised by this theorem ? Here is how one can reconcile to it,

Pick your favorite learner. For every world where it does better than random guessing, I, the devil’s advocate, will deviously construct one where it does worse by the same amount. All I have to do is flip the labels of all unseen instances. Since the labels of the observed ones agree, there’s no way your learner can distinguish between the world and the antiworld. On average over the two, it’s as good as random guessing. And therefore, on average over all possible worlds, pairing each world with its antiworld, your learner is equivalent to flipping coins.

How to escape the above the random guessing limit? Just care about the world we live in and don’t care about alternate worlds. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing. What are the implications of "free lunch theorem" in our modeling world ?

There’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.

Unwritten rule of Machine learning : The author states that the principle laid out by Newton in his work, "Principia", that serves as the first unwritten rule of ML

Whatever is true of everything we’ve seen is true of everything in the universe.

Newton’s principle is only the first step, however. We still need to figure out what is true of everything we’ve seen-how to extract the regularities from the raw data. The standard solution is to assume we know the form of the truth, and the learner’s job is to flesh it out. One of the ways to think about creating a form is via "conjunctive concepts", i.e a series of statements with AND as the bridge. The problem with "conjunctive concepts" is that they are practically useless. Real world is driven by "disjunctive concepts", i.e a concept defined by a set of rules. One of the pioneers in this approach of discovering rules was Ryszard Michalski, a Polish computer scientist. After immigrating to the United States in 1970, he went on to found the symbolist school of machine learning, along with Tom Mitchell and Jaime Carbonell.

Overfitting and Underfitting : The author uses the words "blindness" and "hallucination" to describe underfitting and overfitting models. By using ton of hypothesis, you can almost certainly overfit the data. On the other hand, being sparse in your hypothesis set, you can fail to see the true patterns in the data. This classic problem is obviated by doing out-of-sample testing. Is it good enough ? Well, that’s the best that is available without going in to the muddy philosophical debates or alternative pessimistic approaches like that of Leslie Valiant(author of Probably Approximately Correct).

Induction as inverse of deduction : Symbolists work via the induction route and formulate an elaborate set of rules. Since this route is computationally intensive for large dataset, the symbolists prefer something like decision trees. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. How do we then decide which concept the instance belongs to?

Decision trees are used in many different fields. In machine learning, they grew out of work in psychology. Earl Hunt and colleagues used them in the 1960s to model how humans acquire new concepts, and one of Hunt’s graduate students, J. Ross Quinlan, later tried using them for
chess. His original goal was to predict the outcome of king-rook versus king-knight endgames from the board positions. From those humble beginnings, decision trees have grown to be, according to surveys, the most widely used machine-learning algorithm. It’s not hard to see why: they’re easy to understand, fast to learn, and usually quite accurate without too much tweaking. Quinlan is the most prominent researcher in the symbolist school. An unflappable, down-to-earth Australian, he made decision trees the gold standard in classification by dint of relentlessly improving them year after year, and writing beautifully clear papers about them. Whatever you want to predict, there’s a good chance someone has used a decision tree for it.

The Symbolists : The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols. A mathematician solves equations by moving symbols around and replacing symbols by other symbols according to predefined rules. The same is true of a logician carrying out deductions. According to this hypothesis, intelligence is independent of the substrate.

Symbolist machine learning is an offshoot of the knowledge engineering school of AI. The use of computers to automatically learn the rules made the work of pioneers like Ryszard Michalski, Tom Mitchell, and Ross Quinlan extremely popular and since then the field has exploded

What are the shortcomings of inverse deduction?

  • The number of possible inductions is vast, and unless we stay close to our initial knowledge, it’s easy to get lost in space
  • Inverse deduction is easily confused by noise
  • Real concepts can seldom be concisely defined by a set of rules. They’re not black and white: there’s a large gray area between, say, spam and nonspam. They require weighing and accumulating weak evidence until a clear picture emerges. Diagnosing an illness involves giving more weight to some symptoms than others, and being OK with incomplete evidence. No one has ever succeeded in learning a set of rules that will recognize a cat by looking at the pixels in an image, and probably no one ever will.

An interesting example of a success from Symbolists is Eve, the computer that discovered malaria drug. There was a flurry of excitement a year ago, when an article, titled, Robot Scientist Discovers Potential Malaria Drug was published in Scientific American. This is the kind of learning that Symbolists are gung-ho about.

 

How does your brain learn ?

This chapter covers the second tribe of the five tribes mentioned in the book. This tribe is called "Connectionists". Connectionists are highly critical about the way Symbolists work as they think that describing something via a set of rules is just the tip of iceberg. There is lot more going under the surface that formal reasoning can’t see. Let’s say you come across the word "love", Symbolists would associate a rule with such a concept whereas Connectionists would associate various parts of the brain to such a concept. In a sense, there is no one to one correspondence between a concept and a symbol. Instead the correspondence is many to many. Each concept is represented by many neurons, and each neuron participates in representing many different concepts. Hebb’s rule is the corner stone of connectionists. In a non-math way, it says that "Neurons that fire together stay together". The other big difference between Symbolists and Connectionists is that the former tribe believes in sequential processing whereas the latter tribe believes in parallel processing.

To get some basic understanding of the key algos used by connectionists, it is better to have a bit of understanding of the way neuron is structured in our brain. Here is a visual that I picked up from the author’s presentation :

neuron

The branches of the neuron connect to others via synapses and basic learning takes place via synaptic connections. The first formal model of a neuron was proposed by Warren McCulloch and Walter Pitts in 1943. It looked a lot like the logic gates computers are made of. The problem with this model was that the model did not learn. It was Frank Rosenblatt who came up with the first model of learning by giving variable weights to the connections between neurons. The following is a good schematic diagram of the perceptron:

perceptron

This model generated a lot of excitement and ML received a lot of funding for various research projects. However this excitement was short lived. Marvin Minsky and few others published many examples where perceptron failed to learn. One of the most simple and dangerous example that perceptron could not learn was XOR operator. Perceptron was mathematically unimpeachable, searing in its clarity, and disastrous in its effects. Machine learning at the time was associated mainly with neural networks, and most researchers (not to mention funders) concluded that the only way to build an intelligent system was to explicitly program it. For the next fifteen years, knowledge engineering would hold center stage, and machine learning seemed to have been consigned to the ash heap of history.

Fast forward to John Hopfield work on spin glasses, there was a reincarnation of perceptron

Hopfield noticed an interesting similarity between spin glasses and neural networks: an electron’s spin responds to the behavior of its neighbors much like a neuron does. In the electron’s case, it flips up if the weighted sum of the neighbors exceeds a threshold and flips (or
stays) down otherwise. Inspired by this, he defined a type of neural network that evolves over time in the same way that a spin glass does and postulated that the network’s minimum energy states are its memories. Each such state has a "basin of attraction" of initial states that converge to it, and in this way the network can do pattern recognition: for example, if one of the memories is the pattern of black-and-white pixels formed by the digit nine and the network sees a distorted nine, it will converge to the "ideal" one and thereby recognize it. Suddenly, a vast body of physical theory was applicable to machine learning, and a flood of statistical physicists poured into the field, helping it break out of the local minimum it had been stuck in.

The author goes on to describe "Sigmoid" function and its ubiquitous nature. If you think about the curve for sometime, you will find it everywhere. I think the first time I came across this function was in Charles Handy’s book, "The Age of Paradox". Sigmoid functions in that book are used to describe various types of phenomenon that show an exponential slow rate of increase in the beginning, then a sudden explosive rate of increase and subsequently with an exponential rate of decrease. Basically if you take the first derivative of the Sigmoid function, you get the classic bell curve. I think the book,"The Age of Paradox" had a chapter with some heavy management gyan that went something like – "you need to create another Sigmoid curve in your life before the older Sigmoid curve starts a downfall" or something to that effect. I don’t quite recollect the exact idea from Charles Handy’s book, but there is a blog post by Bret Simmons, titled The Road to Davy’s Bar that goes in to related details.

Well, in the context of ML, the application of Sigmoid curve is more practical. It can be used to replace the step function and suddenly things become more tractable. A single neuron can learn a straight line but a set of neurons, i.e multi-layer perceptron can learn more convoluted curves. Agreed there is a curse of dimensionality here, but if you think about it, the hyperspace explosion is a double edged sword. On the one hand, there objective function is far more wiggly but on the other hand, there is a less scope that you will stuck at a local minimum via gradient search methods. With this Sigmoid input and multi layer tweak, Perceptron came back with vengeance. There was a ton of excitement just like the time when perceptron was introduced. The algorithm by which the learning takes place is called "back propagation", a term that is analogous to how human brains work. This algo was invented by David Rumelhart in 1986. It is a variant of gradient descent method. There is no mathematical proof that Back propagation will find the global minimum/maximum, though. The backprop solves what the author calls "credit assignment" problem. In a multi-layered perceptron the error between the target value and the current value needs to be propagated across all layers backward. The basic idea of error propagation, i.e error assignment for each of the layers is done via backprop.

Whenever the learner’s "retina" sees a new image, that signal propagates forward through the network until it produces an output. Comparing this output with the desired one yields an error signal, which then propagates back through the layers until it reaches the retina. Based on this returning signal and on the inputs it had received during the forward pass, each neuron adjusts its weights. As the network sees more and more images of your grandmother and other people, the weights gradually converge to values that let it discriminate between the two.

Sadly the excitement phase petered down as learning with dozens of hundreds of hidden layers was computationally difficult. In the recent years though, backpropagation has made a comeback thanks to huge computing power and big data. It now goes by the technique "Deep Learning". The key idea of deep learning is based on auto encoders that is explained very well by the author. However there are many things that need to be worked out for deep learning to be anywhere close to the Master algorithm. All said and done there are a few limitations to exclusively following a connectionist tribe. Firstly, the learning algo is difficult to comprehend. It comprises convoluted connections between various neurons. The other limitation is that the approach is not compositional, meaning it is divorced from the way a big part of human cognition works.

 

Evolution : Nature’s Learning Algorithm

The chapter starts with the story of John Holland, the first person to have earned a PhD in computer science in 1959. Holland is known for his immense contribution to Genetic algorithms. His key insight lay in coming up with a fitness function that would assign a score to every program considered. What’s the role of fitness function ? Starting with a population of not-very-fit individuals-possibly completely random ones-the genetic algorithm has to come up with variations that can then be selected according to fitness. How does nature do that? This is where the genetic part of the algorithm comes in. In the same way that DNA encodes an organism as a sequence of base pairs, we can encode a program as a string of bits. Variations are produced by crossovers and mutations. The next breakthrough in the field of genetic programming came from Holland’s student John Koza who came up with the idea of evolving full blown computer programs.

Genetic programming’s first success, in 1995, was in designing electronic circuits. Starting with a pile of electronic components such as transistors, resistors, and capacitors, Koza’s system reinvented a previously patented design for a low-pass filter, a circuit that can be used for things like enhancing the bass on a dance-music track. Since then he’s made a sport of reinventing patented devices, turning them out by the dozen. The next milestone came in 2005, when the US Patent and Trademark Office awarded a patent to a genetically designed factory optimization system. If the Turing test had been to fool a patent examiner instead of a conversationalist, then January 25, 2005, would have been a date for the history books. Koza’s confidence stands out even in a field not known for its shrinking violets. He sees genetic programming as an invention machine, a silicon Edison for the twenty-first century.

A great mystery in genetic programming that is yet to be solved conclusively is the role of crossover. None of Holland’s theoretical results show that crossover actually helps; mutation suffices to exponentially increase the frequency of the fittest schemas in the population over time. There were other problems with genetic programming that finally made ML community at large divorce itself from this tribe

Evolutionaries and connectionists have something important in common: they both design learning algorithms inspired by nature. But then they part ways. Evolutionaries focus on learning structure; to them, fine-tuning an evolved structure by optimizing parameters is of secondary importance. In contrast, connectionists prefer to take a simple, hand-coded structure with lots of connections and let weight learning do all the work. This is machine learning’s version of the nature versus nurture controversy. As in the nature versus nurture debate, neither side has the whole answer; the key is figuring out how to combine the two. The Master Algorithm is neither genetic programming nor backprop, but it has to include the key elements of both: structure learning and weight learning. So, is this it ? Have we stumbled on to the right path for "Master Algorithm" ? Not quite. There are tons of problems with evolutionary algos. Symbolists and Bayesians do not believe in emulating nature. Rather, they want to figure out from first principles what learners should do. If we want to learn to diagnose cancer, for example, it’s not enough to say "this is how nature learns; let’s do the same." There’s too much at stake. Errors cost lives. Symbolists dominated the first few decades of cognitive psychology. In the 1980s and 1990s, connectionists held sway, but now Bayesians are on the rise.

 

In the Church of the Reverend Bayes

Perci Diaconis in his paper titled, MCMC Revolution, says that MCMC technique that came from Bayesian tribe has revolutionized applied mathematics. Indeed, thanks to high performance computing ability, Bayes is now a standard tool in any number cruncher’s tool kit. This chapter talks about various types of Bayesian techniques. The basic idea behind Bayes is that it is a systematic and quantified way of updating degrees of belief, in the light of new data. You can pretty much cast any problem, irrespective of the size of the data available, in to a Bayesian inference problem. Bayes theorem usually goes by the name "inverse probability" because in real life we know Pr(effect|cause) and we are looking to compute Pr(cause|effect). Bayes’ theorem as a foundation for statistics and machine learning is bedeviled not just by computational difficulty but also by extreme controversy. The main point of conflict between Bayesians and Non-Bayesians is the reliance of subjective priors to "turn on the Bayesian crank". Using subjective estimates as probabilities is considered sin by Frequentists, for whom, everything should be learned from the data.

One of the most common variant of Bayesian models is the "Naive Bayes" model where each cause is independent of other causes in creating an effect. Even though this assumption sounds extremely crazy, there are a ton of areas where Naive Bayes beats sophisticated models. No one is sure who invented the Naïve Bayes algorithm. It was mentioned without attribution in a 1973 pattern recognition textbook, but it only took off in the 1990s, when researchers noticed that, surprisingly, it was often more accurate than much more sophisticated learners. Also if you reflect a bit, you will realize that Naive Bayes is closely related to Perceptron algorithm.

The author mentions Markov Models as the next step in the evolution of Bayes models. Markov models are applicable to a family of random variables where each variable is conditionally independent of its history except the current state. Markov chains turn up everywhere and are one of the most intensively studied topics in mathematics, but they’re still a very limited kind of probabilistic model. A more complicated model is Hidden Markov Model where we don’t get to see the actual states but we have to infer them from the observations. A continuous version of HMM goes under the name "Kalman Filter" that has been used in many applications across domains.

Naive Bayes, Markov Models, Hidden Markov Models are all good but they are all a far cry from Symbolists. The next breakthrough came from Judea Pearl who invented Bayesian Networks. This allowed one to specify complex dependencies among random variables. By defining the conditional independence of a variable given a set of neighboring nodes, Bayesian networks tame the combinatorial explosion and make inferences tractable. Basically Bayesian Network can be thought of as a "generative model", a recipe for probabilistically generating a state of the world. Despite the complex nature of a Bayesian net, the author mentions that there have been techniques developed to successfully infer various aspects of the network. In this context, the author mentions MCMC and gives an intuitive explanation of the technique. A misconception amongst many is that MCMC is a simulation technique. Far from it, the procedure does not simulate any real process; rather it is an efficient way to generate samples from a Bayesian network. Inference in Bayesian networks is not limited to computing probabilities. It also includes finding the most probable explanation for the evidence. The author uses the "poster child" example of inferring the probability of heads from coin tosses to illustrate the Bayesian technique and compare it with the Frequentist world of inference.

The next set of models that came to dominate the Bayesian tribe is Markov Networks. A Markov network is a set of features and corresponding weights, which together define a probability distribution. Like Bayesian networks, Markov networks can be represented by graphs, but they have undirected arcs instead of arrows. Markov networks are a staple in many areas, such as computer vision. There are many who feel that Markov networks are far better than Naive Bayes, HMMs etc., as they can capture the influence from surroundings.

Bayesians and symbolists agree that prior assumptions are inevitable, but they differ in the kinds of prior knowledge they allow. For Bayesians, knowledge goes in the prior distribution over the structure and parameters of the model. In principle, the parameter prior could be anything we please, but ironically, Bayesians tend to choose uninformative priors (like assigning the same probability to all hypotheses) because they’re easier to compute with. For structure, Bayesian networks provide an intuitive way to incorporate knowledge: draw an arrow from A to B if you think that A directly causes B. But symbolists are much more flexible: you can provide as prior knowledge to your learner anything you can encode in logic, and practically anything can be encoded in logic-provided it’s black and white.

Clearly, we need both logic and probability. Curing cancer is a good example. A Bayesian network can model a single aspect of how cells function, like gene regulation or protein folding, but only logic can put all the pieces together into a coherent picture. On the other hand, logic can’t deal with incomplete or noisy information, which is pervasive in experimental biology, but Bayesian networks can handle it with aplomb.Combining connectionism and evolutionism was fairly easy: just evolve the network structure and learn the parameters by backpropagation. But unifying logic and probability is a much harder problem.

 

You are what you resemble

The author introduces techniques of the "Analogizers" tribe. This tribe uses similarities among various data points to categorize them in to distinct classes. In some sense, we all learn by analogy. Every example that illustrates an abstract concept is like an analogy. We learn by relating the similarity between two concepts and then figure what else one can infer based on the fact that two concepts are similar.

The chapter begins with explaining the most popular algorithm of the tribe, "the nearest neighbor algorithm". This was invented way back in 1951 by Evelyn Fix and Joe Hodges. The inventors faced a massive difficulty in publishing their algorithm. However the fact that the algo remained unpublished did not faze many researchers who went about developing variants of the algorithm like "K nearest neighbor" method, "Weighted K-nearest neighbor" etc. It was in 1967 that Tom Cover and Peter Hart proved that, given enough data, nearest-neighbor is at worst only twice as error-prone as the best imaginable classifier. This was a momentous revelation. Up until then, all known classifiers assumed that the frontier had a very specific form, typically a straight line. This was a double-edged sword: on the one hand, it made proofs of correctness possible, as in the case of the perceptron, but it also meant that the classifier was strictly limited in what it could learn. Nearest-neighbor was the first algorithm in history that could take advantage of unlimited amounts of data to learn arbitrarily complex concepts. No human being could hope to trace the frontiers it forms in hyperspace from millions of examples, but because of Cover and Hart’s proof, we know that they’re probably not far off the mark.

Is nearest neighbor algo, the master algorithm ? It isn’t because of curse of dimensionality. As the dimension of covariates goes up, the NN algo efficiency goes down. In fact the curse of dimensionality is the second most important stumbling block in the Machine learning, over-fitting being the first one. There are certain techniques to handle the dimension explosion but most of them are hacks and there is no guarantee that they are going to work.

Subsequently, the author introduces Support Vector Machines(SVM) that has become the most popular technique used by Analogizers. I loved the way author describes this technique using plain simple English. He asks the reader to visualize a fat serpent that moves between two countries that are at war. The story of finding the serpent incorporates pretty much all the math that is needed to compute support vectors, i.e.

  • kernel for SVM
  • support vectors
  • weight of the support vectors
  • constrained optimization
  • maximizing the margin of the classifier

My guess is, one would understand the math far easier, after reading through this section on SVMs. SVMs have many advantages and the author highlights most of them. Books such as these also help us in verbalizing math stuff in simple words. For example, if you were to explain the difference between constrained optimization and unconstrained optimization to a taxi driver, how would you do it? Read this book to check whether your explanation is better than what the author provides.

Towards the end of the chapter, the author talks about case-based reasoning and says that in the years to come, analogical reasoning will become so powerful that it will sweep through all the fields where case-based reasoning is still employed.

 

Learning without a teacher

Unlike the previous chapters that focused on labeled data, this chapter is learning via unsupervised learning. Cognitive scientists describe theories of child learning using algos and machine learning researchers have developed techniques based on them. The author explains k-means algorithm, a popular clustering technique. It is actually a special case of Expectation Maximization(EM) algorithm that was invented by three Harvard statisticians. EM is used in a ton of places. To learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on them. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM. Once you have a cluster at the macro level, nothing stops you from using the same algo for each cluster and come up with sub-clusters etc.

Subsequently, the author introduces another popular technique for unsupervised learning, PCA, that is used for dimensional reduction. PCA tries to come up linear combination of various dimensions in the hyperspace so that total variance of the data across all dimensions is maximized. A step up to this algo is called "Isomap", a nonlinear dimensionality reduction technique. It connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network and finds the reduced coordinates that best approximate these distances.

After introducing clustering and dimensional reduction techniques, the author talks about "Reinforcement learning", a technique that relies on immediate response of the environment for various actions of the learner. Research on reinforcement learning started in earnest in the early 1980s, with the work of Rich Sutton and Andy Barto at the University of Massachusetts. They felt that learning depends crucially on interacting with the environment, but supervised algorithms didn’t capture this, and they found inspiration instead in the psychology of animal learning. Sutton went on to become the leading proponent of reinforcement learning. Another key step happened in 1989, when Chris Watkins at Cambridge, initially motivated by his experimental observations of children’s learning, arrived at the modern formulation of reinforcement learning as optimal control in an unknown environment. A recent example of a successful startup that combines neural networks and reinforcement learning is "DeepMind", a company that was acquired by Google for half a billion dollars.

Another algorithm that has a potential to be a part of "Master Algorithm" is chunking. Chunking remains a preeminent example of a learning algorithm inspired by psychology. The author gives a basic outline of this concept. Chunking and reinforcement learning are not as widely used in business as supervised learning, clustering, or dimensionality reduction, but a simpler type of learning by interacting with the environment is: A/B testing. The chapter ends with the author explaining another potentially killer algo, "relational learning".

 

The Pieces of the Puzzle Fall into Place

Progress in science comes from unifying theories; two or more seemingly disparate observations are driven by the same logic or law. If one looks at ML, it appears that Master Algorithm is akin to a unifying theory in science. It will unify all the master algorithms of each tribes, all the techniques of each tribes and give one cohesive way to learn from data.

In fact there is already a technique called "meta learning" that some of the tribes use with in their techniques. For example, bagging, random forests and boosting are some of the famous meta learning techniques used by Symbolists. Bayesians have something called "model averaging" that learns from each model by considering it as an hypothesis and then computes a score based on the vote given by each of the models. Meta learning in its current avatar is remarkably successful, but it’s not a very deep way to combine models. It’s also expensive, requiring as it does many runs of learning, and the combined models can be quite opaque.

The author used the following schematic diagram for each of the tribes, while explaining the rationale of a possible “Master Algorithm”

puzzle

He then takes the reader through a tour of each of the tribes philosophy and their master algorithms and comes up a unifier, called "Alchemy", that he calls it as the "Master Algorithm". In the process of creating this master algorithm, he introduces Markov Logic Networks and says that they serve for representing the problem. Alchemy uses posterior probability as the evaluation function, and genetic search coupled with gradient descent as the optimizer. The author is wary about Alchemy’s immediate application and says that there is a ton of research that is yet to be done, so that it can become a true Master Algorithm, i.e., one that has the capability to solve hard problems.

This chapter is a little more involved as it tries to connect all the ideas from the previous eight chapters and introduces a way to combine the various pieces of puzzle to create a "Master Algorithm". The chapter will also be very interesting for an aspiring ML researcher  who is trying to pick his focus area.

 

This is the World on Machine Learning

The last chapter of the book discusses the world in which "Master Algorithm" is all pervasive. The author tries to speculate answers to the following questions :

  • Will humans be replaced by machines ?
  • What do you want the learners of the world to know about you ?
  • How good a model of you, a learner, can have ?
  • How will ecommerce shopping experience be ?
  • Will there be a rise of "humanities" disciple after the automation of most of non-human related tasks ?
  • What will the current online dating sites morph in to ?
  • How will the Amazons, Netflixes, Googles of the world change ?
  • What will be the privacy issues in a society where most of the transactions and activities involve, one algo talking to another algo?
  • Will future wars be fought by robots ?
  • Will robot-warfare be viable ?
  • Will AI and Master Algorithm take over the world ?

The author ends the book by saying,

Natural learning itself has gone through three phases: evolution, the brain, and culture. Each is  product of the previous one, and each learns faster. Machine learning is the logical next stage of this progression. Computer programs are the fastest replicators on Earth: copying them takes  only a fraction of a second. But creating them is slow, if it has to be done by humans. Machine learning removes that bottleneck, leaving a final one: the speed at which humans can absorb change.

takeawayTakeaway :

This book is a pop-science book for Machine Learning. ML has reached a point where it is not just for geeks anymore. Every one needs to know about it, Every one needs to at least have a conceptual model about the field as it has become all pervasive. Having said that, it would take time to plod though the book if you are a complete newbie to ML. This book is massively appealing to someone who a cursory knowledge of a few ML techniques and wants to have a 10,000 ft. view of the entire fields’ past, present and future. The future, as the title of book states is, would be the invention of a master algorithm that unifies methods across all the five tribes.

book-cover

If one rips apart a computer and looks at its innards, one sees a coalescence of beautiful ideas. The modern computer is built based on electronics but the ideas behind its design has nothing to do with electronics. A basic design can be built from valves/water pipes etc. The principles are the essence of what makes computers compute.

This book introduces some of the main ideas of computer science such as Boolean logic, finite-state machines, programming languages, compilers and interpreters, Turing universality, information theory, algorithms and algorithmic complexity, heuristics, uncomutable functions, parallel computing, quantum computing, neural networks, machine learning, and self-organizing systems.

Who might be the target audience of the book ? I guess the book might appeal to someone who has a passing familiarity of main ideas of computer science and wants to know a bit more, without being overwhelmed. If you ever wanted to explain the key ideas of computing to a numerate high school kid in such a way that it is illuminating and at the same time makes him/her curious, and you are struggling to explain the ideas in simple words, then you might enjoy this book.

Nuts and Bolts

Recollect any algebraic equation you would have come across in your high school. Now replace the variables with logic statements that are either true or false, replace the arithmetic operations with relevant Boolean operators, you get a Boolean algebraic equation. This sort of algebra was invented by George Boole. It was Claude Shannon who showed that one could build electronic circuits that could mirror any Boolean expression. The implication of this construction is that any function capable of being described as a precise logical statement can be implemented by an analogous system of switches.

Two principles of any type of computer involves 1) reducing a task to a set of logical functions and 2) implementing logical functions as a circuit of connected switches. In order to illustrate these two principles, the author narrates his experience of building tic-tac-toe game with a set of switches and bulbs. If you were to manually enumerate all the possible combinations of a 2 player game, one of the best representations of all the moves is via a decision tree. This decision tree will help one choose the right response for whatever be your opponent’s move. Traversing through the decision tree means evaluating a set of Boolean expressions. Using switches in series or parallel, it is possible to create a automated response to a player’s moves. The author describes briefly the circuit he built using 150 odd switches. So, all one needs to construct an automated tic-tac-toe game is switches, wires, bulbs and a way to construct logical gates. There are two crucial elements missing from the design. First is that the circuit has no concept of events happening over time; therefore the entire sequence of game must be determined in advance. The second aspect that is missing is that the circuit can perform only one function. There is no software in it.

Early computers were made with mechanical components. One can represent AND, OR, NOR, etc. using a set of mechanical contraptions that can then be used to perform calculations. Even in 1960’s most of the arithmetic calculators were mechanical.

Building a computer out of any technology requires two components, i.e.

  • switches : this is the steering element which can combine multiple signals in to one signal
  • connectors : carries signal between switches

The author uses hydraulic valves, tinker toys, mechanical contraptions, electrical circuits to show the various ways in which the Boolean logic can be implemented. The basic takeaway from this chapter is the principle of functional abstraction. This process of functional abstraction is a fundamental in computer design-not the only way to design complicated systems but the most common way. Computers are built up on a hierarchy of such functional abstractions, each one embodied in a building block. Once someone implements a 0/1 logic, you built stuff over it. The blocks that perform functions are hooked together to implement more complex functions, and these collections of blocks in turn become the new building blocks for the next level.

Universal Building blocks

The author explains the basic mechanism behind translating any logical function in to a circuit. One needs to break down the logical function in to parts, figure out the type of gates that are required to translate various types of input to the appropriate output of a logical function. To make this somewhat abstract principle more concrete, the author comes up with a circuit design for "Majority wins" logical block. He also explains how one could design a circuit for “Rocks-Paper-Scissors” game. The learnings from these simple circuits are then extrapolated to the general computer. Consider an operation like addition or multiplication; you can always represent it as a Boolean logic block. Most computers have logical blocks called arithmetic units that perform this job.

Hence the core idea behind making a computer perform subtraction, addition, multiplication, or whatever computation is to write out the Boolean logic on a piece of paper and then implement the same using logical gates.

The most important class of functions are time varying functions, i.e. the output depends on the previous history of inputs. These are handled via a finite-state machine. The author describes the basic idea of finite-state machine via examples such as as ball point pen, combination lock, tally counter, odometer etc. To store this finite-state machine, one needs a device called register. An n-bit register has n inputs and n outputs, plus an additional timing input that tells the register when to change state. Storing new information is called "writing" the state of the register. When the timing signal tells the register to write a new state, the register changes its state to match the inputs. The outputs of the register always indicate its current state. Registers can be implemented in many ways, one of which is to use a Boolean logic block to steer the state information around in a circle. This type of register is often used in electronic computers, which is why they lose track of what they’re doing if their power is interrupted.

A finite-state machine consists of a Boolean logic block connected to a register. The finite-state machine advances its state by writing the output of the Boolean logic block into the register; the logic block then computes the next state, based on the input and the current state. This next state is then written into the register on the next cycle. The process repeats in every cycle. The machine on which I am writing this blog post is 1.7GHz machine, i.e. the machine can change its state at a rate of 1.7billion times per second. To explain the details of a finite state machine and its circuit implementation, the author uses familiar examples like "traffic lights", "combination lock". These examples are superbly explained and that’s the beauty of this book. Simple examples are chosen to illustrate profound ideas.

Finite state machines are powerful but limited. They cannot be used for stating many patterns that are common in our world. For instance, it is impossible to build a finite-state machine that will unlock a lock whenever you enter any palindrome. So, we need something else too besides logic gates and finite state machines.

Programming

Boolean logic and finite-state machine are the building blocks of computer hardware. Programming language is the building block of computer software. There are many programming languages and if you learn one or two, you can quickly pick up the others. Having said, writing code to perform something is one thing and writing effective code is completely a different thing. The latter takes many hours of deliberate effort. This is aptly put by the author,

Every computer language has its Shakespeares, and it is a joy to read their code. A well-written computer program possesses style, finesse, even humor-and a clarity that rivals the best prose.

There are many programming languages and each has its own specific syntax.The syntax in these languages are convenient to write as compared to machine level language instructions. Once you have a written a program, how does the machine know what to do ? There are three main steps :

  1. a finite-state machine can be extended, by adding a storage device called a memory, which will allow the machine to store the definitions of what it’s asked to do
  2. extended machine can follow instructions written in machine language, a simple language that specifies the machine’s operation
  3. machine language can instruct the machine to interpret the programming language

A computer is just a special type of finite-state machine connected to a memory. The computer’s memory-in effect, an array of cubbyholes for storing data-is built of registers, like the registers that hold the states of finite-state machines. Each register holds a pattern of bits called a word, which can be read (or written) by the finite-state machine. The number of bits in a word varies from computer to computer. Each register in the memory has a different address, so registers are referred to as locations in memory. The memory contains Boolean logic blocks, which decode the address and select the location for reading or writing. If data is to be written at this memory location, these logic blocks store the new data into the addressed register. If the register is to be read, the logic blocks steer the data from the addressed register to the memory’s output, which is connected to the input of the finite-state machine.Memory can contain data, processing instructions, control instructions. These instructions are stored in machine language. Here is the basic hierarchy of functional dependence over various components:

  1. Whatever we need the computer to perform, we write it in a programming language
  2. This is converted in to machine language by a compiler, via a predetermined set of subroutines called operating system
  3. The instructions are stored in a memory. These are categorized in to control and processing instructions.
  4. Finite state machines fetch the instructions and execute the instructions
  5. The instructions as well as data are represented by bits, are stored in the memory
  6. Finite state machines and memory are built based on storage registers and Boolean blocks
  7. Boolean blocks are implemented via switches in series or parallel
  8. Switches control something physical that sends 0 or 1

If you look at these steps, each idea is built up on a a level of abstraction. In one sense, that’s why any one can write simple software programs without understanding a lot of details about the computer architecture. However the more one goes gets closer to the machine language, the more he needs to understand the details of various abstractions that have been implemented.

How universal are Turing machines ?

The author discusses the idea of universal computer, that was first described in 1937 by Alan Turing. What’s a Turing machine ?

Imagine a mathematician performing calculations on a scroll of paper. Imagine further that the scroll is infinitely long, so that we don’t need to worry about running out of places to write things down. The mathematician will be able to solve any solvable computational problem no matter how many operations are involved, although it may take him an inordinate amount of time.

Turing showed that any calculation that can be performed by a smart mathematician can also be performed by a stupid but meticulous clerk who follows a simple set of rules for reading and writing the information on the scroll. In fact, he showed that the human clerk can be replaced by a finite-state machine. The finite-state machine looks at only one symbol on the scroll at a time,
so the scroll is best thought of as a narrow paper tape, with a single symbol on each line. Today, we call the combination of a finite-state machine with an infinitely long tape a Turing machine

The author also gives a few example of noncomputable problems such as "Halting problem". He also touches briefly upon quantum computing and explains the core idea using water molecule

When two hydrogen atoms bind to an oxygen atom to form a water molecule, these atoms somehow "compute" that the angle between the two bonds should be 107 degrees. It is possible to approximately calculate this angle from quantum mechanical principles using a digital computer, but it takes a long time, and the more accurate the calculation the longer it takes. Yet every molecule in a glass of water is able to perform this calculation almost instantly. How can a single molecule be so much faster than a digital computer?

The reason it takes the computer so long to calculate this quantum mechanical problem is that the computer would have to take into account an infinite number of possible configurations of the water molecule to produce an exact answer. The calculation must allow for the fact that the atoms comprising the molecule can be in all configurations at once. This is why the computer can only approximate the answer in a finite amount of time.

One way of explaining how the water molecule can make the same calculation is to imagine it trying out every possible configuration simultaneously-in other words, using parallel processing. Could we harness this simultaneous computing capability of quantum mechanical objects to produce a more powerful computer? Nobody knows for sure.

Algorithms and Heuristics

The author uses simple examples to discuss various aspects of any algorithm like designing an algorithm, computing the running time of an algorithm etc. For many of the problems where precise algorithms are not available, the next best option is to use heuristics. Designing heuristics is akin to art. Most of the real life problems requires one to come up with a healthy mix of algorithmic solutions and heuristics based solutions. IBM Deep Blue is an amazing example that shows the intermixing of algorithms and heuristics can beat one of the best human minds. Indeed there will be a few cognitive tasks that will be out of computer’s reach and humans have to focus on skills that are inherently human. A book length treatment is given to this topic by Geoff Colvin in his book (Humans are underrated)

Memory : Information and Secret codes

Computers do not have infinite memory. There needs a way to measure the information stored in the memory. An n bit memory can store n bits of data. But we need to know how many bits are required to store a certain form of input. One can think of various ways for doing so. One could use the "representative" definition, i.e think about how each character in the input is represented on a computer and then assess the total number of bits required to represent the input in a computer. Let’s say this blog post has 25000 characters and each character takes about 8 bits on my computer, then the blog post takes up 0.2 million bits. 8 bits for each character could be a stretch. May be all the characters in this post might require only 6 bits per character. Hence there would 0.15 million bits required to represent this post. The problem with this kind of quantifying information is that it is "representation" dependent. Ideal measure would be the minimum number of bits needed to represent this information. Hence the key question here is, "How much can you compress a given text without losing information?".

Let’s say one wants to store the text of "War and Peace", the information size obtained by multiplying 8 bits times number of characters in the novel would give a upper bound. By considering various forms of compression such as 6 bit encoding, taking advantage in regularities of data, take advantage of grammar associated with the language, one can reduce the information size of the novel. In the end, the compression that uses the best available statistical methods would probably reach an average representation size of fewer than 2 bits per character-about 25 percent of the standard 8-bit character representation.

If the minimum number of bits required to represent the image is taken as a measure of the amount of information in the image, then an image that is easy to compress will have less information. A picture of a face, for example, will have less information than a picture of a pile of pebbles on the beach, because the adjacent pixels in the facial image are more likely to be similar. The pebbles require more information to be communicated and stored, even though a human observer might find the picture of the face much more informative. By this measure, the picture containing the most information would be a picture of completely random pixels, like the static on a damaged television set. If the dots in the image have no correlation to their neighbors, there is no regularity to compress. So, pictures that are totally random require lot many bits and hence contain lot of information. This go against our notion of information. In common parlance, a picture with random dots should have less information that a picture with some specific dot pattern. Hence it is important for computers to store meaningful information. Indeed that’s how many image and sound compression algos work. They discard meaningless information. Another level of generalization of this idea is to consider a program that can generate the data that is being stored. This leads us to another measure of information:

The amount of information in a pattern of bits is equal to the length of the smallest computer program capable of generating those bits.

This definition of information holds whether the pattern of bits ultimately represents a picture, a sound, a text, a number, or anything else.

The second part of this chapter talks about public-key private key encryption and error correction mechanisms. The author strips off all the math and explains it in plain simple English.

Speed: Parallel Computers

"Parallel computing" is a word that is tossed around at many places. What is the basic problem with a normal computer that parallel computing aims to solve? If you look at the basic design of a computer, it has always been that processing and memory were considered two separate components, and all the effort was directed towards increasing the processing speed. If you compare the modern silicon chip based computer to let’s say the vintage room filled computer, the basic two-part design has remained the same, i.e. processor connected to memory design. This has come to be known as sequential computer and has given the need for parallel computing. To work any faster, today’s computers need to do more than one operation at once. We can accomplish this by breaking up the computer memory into lots of little memories and giving each its own processor. Such a machine is called a parallel computer. Parallel computers are practical because of the low cost and small size of microprocessors. We can build a parallel computer by hooking together dozens, hundreds, or even thousands of these smaller processors. The fastest computers in the world are massively parallel computers, which use thousands or even tens of thousands of processors.

In the early days of parallel computing, many people were skeptical about it for a couple of reasons, the main being Amdahl’s law. It stated that no matter how many parallel processor you use, there will be at least 10% of the task that needs to be done sequentially and hence the task completion time time does not do down as rapidly as one uses more and more processors. Soon it was realized that most of the tasks had negligible amount of sequential component. By smart design, one could parallelize many tasks.

Highly parallel computers are now fairly common. They are used mostly in very large numerical calculations (like the weather simulation) or in large database calculations, such as extracting marketing data from credit card transactions. Since parallel computers are built of the same parts as personal computers, they are likely to become less expensive and more common with time. One of the most interesting parallel computers today is the one that is emerging almost by accident from the networking of sequential machines. The worldwide network of computers called the Internet is still used primarily as a communications system for people. The computers act mostly as a medium-storing and delivering information (like electronic mail) that is meaningful only to humans.Already standards are beginning to emerge that allow these computers to exchange programs as well as data. The computers on the Internet, working together, have a potential computational capability that far surpasses any individual computer that has ever been constructed.

Computers that learn and adapt

The author gives some basic ideas on which adaptive learning has been explored via algorithms. Using the example of neutral networks, perceptron, etc. the author manages to explain the key idea of machine learning.

Beyond Engineering

Brain cannot be analyzed via the usual "divide and conquer" mechanism that is used to understand "sequential computer". As long as the function of each part is carefully specified and implemented, and as long as the interactions between the parts are controlled and predictable, this system of "divide and conquer" works very well, but an evolved object like the brain does not necessarily have this kind of hierarchical structure. The brain is much more complicated than a computer, yet it is much less prone to catastrophic failure. The contrast in reliability between the brain and the computer illustrates the difference between the products of evolution and those of engineering. A single error in a computer’s program can cause it to crash, but the brain is usually able to tolerate bad ideas and incorrect information and even malfunctioning components. Individual neurons in the brain are constantly dying, and are never replaced; unless the damage is severe, the brain manages to adapt and compensate for these failures. Humans rarely crash.

So, how does on go about designing something that is different from "engineering"? The author illustrates this via "sorting numbers" example in which one can go about as follows

  • Generate a "population" of random programs
  • Test the population to find which programs are the most successful.
  • Assign a fitness score to each program based on how successfully they sort the numbers
  • Create new populations descended from the high-scoring programs. One can think of many ways here; only the fittest survive / "breed" new programs by pairing survivors in the previous generation
  • When the new generation of programs is produced, it is again subjected to the same testing and selection procedure, so that once again the fittest programs survive and reproduce. A parallel computer will produce a new generation every few seconds, so the selection and variation processes can feasibly be repeated many thousands of times. With each generation, the average fitness of the population tends to increase-that is, the programs get better and better at sorting. After a few thousand generations, the programs will sort perfectly.

The author confesses that the output from the above experiments works, but it is difficult to "understand" why it works.

One of the interesting things about the sorting programs that evolved in my experiment is that I do not understand how they work. I have carefully examined their instruction sequences, but I do not understand them: I have no simpler explanation of how the programs work than the instruction sequences themselves. It may be that the programs are not understandable-that there is no way to break the operation of the program into a hierarchy of understandable parts. If
this is true-if evolution can produce something as simple as a sorting program which is fundamentally incomprehensible-it does not bode well for our prospects of ever understanding the human brain.

The chapter ends with a discussion about building a thinking machine.

takeawayTakeaway :

The book explains the main ideas of "computer science" in simple words. Subsequently, it discusses many interesting areas that are currently being explored by researchers and practitioners. Anyone who is curious about things like "How do computers work ?", "What are the limitations of the today’s computer hardware and software ?", etc. and wants to get some idea about them, without getting overwhelmed by the answers, will find this book interesting.

image

I have enjoyed reading Nicholas Carr’s previous book titled “The Shallows”, that discusses at length various ways in which Internet impacts our brain. In this book, the author takes “automation” as the subject and explores its human consequences. In this blog post, I will summarize the main points of various chapters.

Alert for Operators

This chapter opens with a terse notice issued by Federal Aviation Administration on Jan 2013. This note was addressed to all the US. Airlines, commercial air carriers and said,

The FAA encourages operators to promote manual flight operations when appropriate.

Airline cockpits in today’s world are totally automated, reducing the role of pilot to a baby-sitter. Why does FAA want operators to do things manually? What was lost in the process of completely automating the flight operations? These are some of the teaser questions that motivates a reader to start reading the book.

Passengers

The author narrates his experience of making a transition from geared car to an automatic car. After the initial euphoria subsided, he realized that driving a car became a boring task as he was reduced to a passenger in his own car.

Most of the knowledge that we gather over our lifetime can be categorized as tacit and explicit knowledge. It is easy to argue that explicit knowledge can be programmed and can be automated. Tacit knowledge on the other hand, one might think, is hard to automate. The author argues that it is no longer the case. Google’s self-driving car that can travel on busy streets without any human aid is a case in point. The distinction between tacit and explicit knowledge is being blurred.

There are many disciplines where automation is changing the workflow. The flip side to “automation bringing in efficiency and cost savings” is that it is marginalizing the role of humans. Why do we allow automation in the first place? There seems to be an increasing trend towards people wanting things to be automated. The author traces this behavior to a study by Mihaly Csikszentmihalyi and Judith Lefevre who stumble upon a strange behavior in their social science experiment involving hundreds of blue-collar and white-collar workers. They found that people are happier, feel more fulfilled by what they were doing, while they are at work than during their leisure hours. In their free time, most of the workers tend to feel bored and anxious. And yet they didn’t like to be at work. When they were at the job, they expressed a strong desire to be off the job, and when they were off the job, the last thing they wanted was to go back to work. This is a paradox; people having more positive feelings at work than in leisure, yet saying that they ‘wish to be doing something else’ when they are at work, not when they are in leisure. Psychologists have given a poetic name to this behavior, “miswanting”.

We are inclined to desire things we don’t like and to like things we don’t desire

This type of behavior is one of the reasons, the author argues, that people welcome automation. By reducing the burden of their work, they think that they can indulge in doing better things at work. On the contrary, automation when it goes the whole hog, removes complexity from jobs, diminishing the challenge they provide for the person at work. Automation intrinsically is not bad. The problem is that we don’t know when to say enough. The deck is stacked, economically and emotionally, in automation’s favor. We overestimate automation benefits and that coupled with our bias of leisure over work makes it easy for companies to adopt automation rapidly.

The Robot at the gate

This chapter traces the history of automation on the machine floor. If you think robots are still years away from finding place on the factory floor, just watch the video on CBS 60 minutes :

Link : Are robots hurting job growth?

The author kind of summarizes the content of above video in this chapter adding some additional references to the content. The basic theme running through this chapter is the automation is growing at a rapid pace and displacing people from their jobs. Job creation is not happening at the same pace with which the jobs are being automated.

On Autopilot

This chapter is about the way automation has affected the airline industry. It traces back the history of automation in the context of airlines and discusses various issues that have become a part of current day’s reality. It was on June 18, 1914 that the story of flight automation began. Elmer Sperry and Lawrence Sperry (the father-son duo) designed “gyroscopic stabilizer apparatus”, an instrument that could sense with remarkable precision, a plane’s orientation along its three axes of rotation. An aerial display of the capabilities of this instrument took everyone’s imagination by storm.

It took another twenty years before US military could refine the ideas and make a gyroscopic autopilot, the first commercial flight that flew three hours without any human aid. The introduction of gyroscopic autopilot set the stage for a momentous expansion of aviation’s role in warfare and transport. In the next 40 years several technological advances lead to the launch of A320, a 150-seat plane, a first commercial aircraft that was truly computerized. Dials and gauges gave way to glass screens and the pilot’s role changed forever. His cockpit to begin with, was a glass cockpit, that showed various aspects of the running plane. The computer in the plane was in charge of everything and the human’s role was significantly marginalized. The essence of pilot’s job consisted in sending digital inputs to computers and monitoring computers digital outputs—while the computers govern the plane’s moving parts and chose its course. The computer had its final say in times of extreme maneuvers. The commercial pilot lost the aura of romance and adventure.

What was the result of this automation? The first outcome of severing the tactile link between the pilot and plane resulted in pilots’ motor skills becoming rusty. Many studies have shown that the automation has led to a gradual erosion of pilot’s expertise, have dulled their reflexes, diminished their attentiveness, degraded their situational awareness and have weakened hand-flying skills. If you pause for a while, it seems logical that increased automation would result in all these things. But then what is the role of pilot? Should there be human pilots at all? If so, how should they be trained given that computer has completely taken over?

The Degeneration Effect

Retrieving a concept from the memory changes the memory associated with that concept. This observation has led to many interesting study hacks and an umbrella term used to capture this effect is called the “generation effect”. The act of active retrieval of information enhances the way the information is stored in our brains. The fact of generating ideas before you encounter something new, before you reread something leads to better retention and comprehension. The author says usage of more automated tools actually leads to a reverse of “generation effect” and terms it as “degeneration effect”.

When we tackle a task with the aid of computers, we run the risk of two things:

  • Automation complacency: This is our tendency to trust the systems so much that we drift away and lose attention
  • Automation bias: This is our tendency to ignore things when system goes wrong. We somehow believe that the system is still working and working well.

The author is of the opinion that these two aspects of automation will rob us from deep learning a task. Diminished intensity of thinking leads to poor encoding of information in our brains. I don’t agree fully with the content of this chapter. Let’s say there is a tool that generates a boxplot automatically. Does it rob the analyst of getting deep learning? Not really. The painful task of sorting the data, picking the first, second and third quantile data values and drawing the outliers is automated by visualization software. This leaves the data analyst with more time to actually understand the data. Same is the case with an IDE. Some might think that IDE dulls one’s brains. This argument is also stretched. Yes IDE with code completion features makes one type less, makes workspace management easy etc. Indeed there is some learning in doing all these things manually but for many, IDE is to write some code to get things done. For getting things done, how does it matter if I use a convenient IDE or use some really complicated stuff like emacs or use command line for doing things? Does IDE make me one a dull boy? I don’t think so. Look at the rise of iPython in the Python community or RStudio for the R community. By relieving the analyst of many chores that a programmer does, it gives far more time to actually analyze data and build models. Somehow I find it difficult to buy the argument that automation goes against deep learning. Automating the painful tasks is good and should always be welcome.

Interlude, with Dancing Mice

The author talks about Yerkes-Dokson Law. The law was formulated by two social scientists after doing a ton of experiments on mice. The experimental setup included mice choosing one of the two paths. One of the paths came with an electric shock treatment. When the mice were given a barely noticeable shock in one of paths, they barely noticed the difference between the two paths. Same is the case with using high intensity electric shocks. Only when the shock level was moderate were the mice able to differentiate between the two paths. Since the publication of this research in 1908, there have been many fields where this law has been found to be true. The author uses this law to explain the effects of automation. When the automation reduces the workload too much, humans tend to drift away from work and lose their attention. This is similar to moving to the left of Yerkes-Dokson curve, under stimulation reduces performance. Similarly if the automation increases the need to take care of many things (especially in times of crisis), humans tend to buckle down. This is similar to moving to the right of Yerkes-Dokson curve. The ideal is somewhere in between. The automation should reduce our burden just enough so that we are not under -stimulated or over stimulated.

White-Collar Computer

The author talks about specific domains and discusses the effects of automation. Medical records automation was heralded as an event that would change the face of US Health care industry. Till date there is no evidence that a decade long project involving government money and tax payer’s money has yielded significant reduction in healthcare costs in US. Most of the medical record systems are operating as silos and the only people who are going to benefit at least in the near future are the IT industry personnel who will write code for the disparate systems to talk to each other. The frequency of medical tests done by a patient was supposed to have gone down post automation. However empirical studies show that doctors with easy accessible medical history are recommending more tests to be done than the pre-automation era.

In this age of big data, machine learning, predictive models are pervading every domain; it is very likely that algorithms will be used for many of the tasks that were previously considered complex and required human personnel to work on it. In the area of finance, traders and dealers are being replaced by computers. Execution algorithms have taken over the sell side of Wall Street and in many countries. These algorithms have turned “sell side traders” in to “sell side computer operators”. Wall Street and many other financial firms love this, as they have to deal less and less with the human side of things. This automation is also visible on the “Buy Side” where the fund managers have tools to directly send orders to the exchange using the broker’s infrastructure. At least most of the quant shops and HFT players largely employ computer professionals who love automated systems. A machinist is turning in to a machine operator. The author talks about the legal profession too, where automation is reducing the job of several hundred man hours to merely hours, by using pattern matching and statistical learning techniques. Given the way technology is going, the title of the chapter is apt. We will increasingly see white-collar jobs being done by a computer.

World and Screen

Location aids have always been improving as technology evolved. Simple maps, trail markers , star maps , nautical charts, compasses, sextants, lighthouses, buoys , signs posted on highways, have made our lives easier by helping us navigate the external world. These have been “aids” for us to navigate spaces. One might think that GPS is yet another step in the evolution of location aids. The author argues that GPS is a fundamentally different type of aid. In fact he desists from calling it an aid since GPS gives you step by step instruction to move from your origin to destination and you tend to follow it rather blindly. Unlike the analog world of digital aids, there is no “figuring out” to be done. The whole process of going from point A to point B is rather mechanistic. Ok, so what? Is it not ok to reach destination and let the details of the travel be taken care of by some software. Paper maps do not shepherd us from one place to another. They teach us how to think about space.

The author cites many research articles on memory and says the more we use GPS, the less we build cognitive maps and hence it affects our memory building skills. There are specific areas in our brains that grow if we put effort in wayfinding. The more automatic wayfinding becomes, there is lesser chance of a growth in “space cells” and “grid cells”. Does it matter if we do not have enough of those cells?

In a 2013 article in Nature Neuroscience, Edvard Moser and his colleague György Buzsáki provided extensive experimental evidence that “the neuronal mechanisms that evolved to define the spatial relationship among landmarks can also serve to embody associations among objects, events and other types of factual information.” Out of such associations we weave the memories of our lives. It may well be that the brain’s navigational sense — its ancient, intricate way of plotting and recording movement through space — is the evolutionary font of all memory.

The author presents a scary scenario in which he compares our loss of spatial sense with the onset of Alzheimer’s disease. If we keep using automated tools for wayfinding, will we lose our spatial sense completely and be vulnerable to Alzheimer’s? The author cites a few experts who are of the view that in the years to come, dementia will strike people earlier and earlier.

Another area where automation has started showing its harmful effects is design and architecture. The author’s target is CAD, the software that has indeed changed the way architects, engineers design buildings. CAD in its initial versions was meant to translate design in to execution plans. Physical act of drawing has been replaced by software clicks. The whole task of designing has changed. Architects used to take many years of apprenticeship before they could build deep knowledge of various design elements. CAD after its many versions has come to embody so many features. In the current version, it enables a young architect to put in specifications, play with the software generated design and come up with all funky designs. The young architect might be thinking that he is developing some innovative design for a mall or building. However in many cases, he might just be creating a more standardized design that the software is designed to churn out. Somehow the divorce of physical action from the design process seems to have had a perverse effect on the design community as a whole. The author cites many examples where the building architecture is becoming more standardized.

Automation for the People

The author talks about the need for more humane approach to designing software. One can infer from Yerkes-Dokson law that any automation should not under stimulate or over simulate the human operator. This means that sometimes it might be good for the software to hand over the reins to the human in charge so that his skills do not become rusty. This is easier said than done as companies world over, do not want to compromise on efficiency and profits. Learning always entails some inefficiency and hence the firms might not be willing to invest time and effort in developing humane interfaces. This tension can be clearly seen in the way passenger flights are designed. Airbus takes a completely technology oriented approach where the pilots are meant to “baby-sit” in the cockpit. Boeing takes a more human centered approach in its cockpit design where pseudo tactile feedback is given to the pilots so that they do not run the risk of tuning out because of automation complacency or automation bias. Michael Lewis’s book, Flash boys narrates the story of IEX, an exchange that is trying to bring humane approach to trading by slowing down the execution times. The chapter ends with the author taking a not so rosy view of the future, i.e. even though human-computer interfaces need to be more like human-human interaction, it is unlikely that such interfaces will be built. The force of automation in many fields is marginalizing the role of a human and making it a passive one.

Your Inner Drone

In any activity where computers replace humans, there are moral aspects of the work that need to be taken care of. This would not be an issue if the automation has been introduced merely to replace mundane tasks. In today’s world, computers are doing the entire gamut of activities in many diverse fields such as military, design and architecture, commercial airlines, agriculture, finance, health care etc.

The author talks about the use of LARs (Lethal Autonomous Robots) being used in wars and says that LARs takes decision completely based on the past patterns and real time data. A LAR is different from a drone. The latter is automatic only to a certain extent and humans have a say in the location and the time of attack. When a LAR sees a civilian, how does it calculate the probability whether to bomb or not? After all, the basic software is written by humans; one would assume that humans can write the logic for morality too. Typically in any software design of such nature, one follows a top down approach or bottoms up approach or a hybrid. It is abundantly clear that in programming LARs, it has to be a hybrid approach. But will hybrid approach work? What is the extent to which such issues could be programmed? What is the extent to which LARs can be allowed to take decisions? These are all moral dilemmas and with the ongoing pace of automation, these will become more and more important issues to think about. There is a possibility that increasing usage of LARs will change the way we think about wars. Since LARs will decrease the number of human casualties in the war, may be the threshold for launching in to a war with rival nation will go down. “Change the tool – change the war” is a possible reality.

That Love that lays the Swale in Rows

The last chapter of the book explores human relationship with tools. Tools are extensions of human thought. We show our creativity in designing tools and more importantly in using tools. It is this nature that is misunderstood by many for various reasons. Somehow the thinking out there amongst many is that tools are slaves and we are the masters. The more we allow our tools/slaves to work, the more time we get to work on things that we like. This line of thinking forces to automate all kinds of tasks assuming that it will free our time. Once the automation exceeds the peak point of Yerkes-Dokson, we soon realize that the computer is the master and we are the slave. We are reduced to passive observers and become watchmen, become spectators rather than players. So, What? Shouldn’t outcome is all that matters? The author argues that tools and the way we create and use the tools define us. If we are actively using tools to learn something deep, immerse in an activity with them that is productive as well as exhausting, use tools to fail as well as learn, we get to know ourselves and the world in a better way. This kind of thinking “tools as an extension of us” is a better frame of mind to have, than a “master-slave” type of thinking. This will make us better designers and better users of the automation tools.

clip_image002Takeaway

The book discusses the consequences of automation in various fields. Airline cockpits, medical record systems, work flow in the factories, trading in the financial markets, automobile maneuvering, military warfare are some of the areas mentioned in the book. The effects of automation in these fields has fundamentally changed the way people work. From a firm’s perspective, automation is no brainer as it brings in efficiency and reduces costs. However from the perspective of the person whose job is getting automated, the consequences are dreadful. If you think that your job has a decent chance of getting automated, you might want to read this book and ponder over various points discussed by the author.

image

For those of us who are born before 1985, it is likely that we have seen two worlds; one, a world that wasn’t dependent on net and another, where our lives are dominated/controlled by the web and social media. The author says that that given this vantage point, we have a unique perspective of how things have changed. It is impossible to imagine a life without print. However before the 1450’s Guttenberg printing press invention, the knowledge access was primarily through oral tradition. Similarly may be a decade or two from now, our next generation would be hard-pressed to imagine a life without connectivity. There is a BIG difference between the Guttenberg’s revolution and Internet—the pace. Even though the printing press was invented around 1450’s, it was not until 19th century that enough people were literate, for the written word to influence the society. In contrast, we have seen both the offline and online world in less than one life time.

We are in a sense a straddle generation, with one foot in the digital pond and the other on the shore, and are experiencing a strange suffering as we acclimatize. In a single generation, we are the only people in the history experiencing massive level of change. The author is not critical about technology, after all technology is neither good nor bad. It is neutral. Firstly something about the word in  title—Absence. It is used as a catch-all term for any activity that does not involve internet, mobile, tablets, social media etc. Given this context, the author structures the content of the book in to two parts. The first part of the book explores certain aspects of our behavior that have dramatically changed and we can see the consequences of it all around us. The second part of the book is part reflection, part experimentation by the author to remain disconnected in this hyper connected world. In this post, I will summarize a few sections of the book.

Kids these days

Increasingly the kids are living in a world where the daydreaming silences in the lives are filled by social media notifications and burning solitudes are extinguished by constant yapping on the social networks/phones and playing video games. That said, what is role of a parent? The author argues that we have a responsibility of providing enough offline time to children.

How often have you seen a teenager staring outside a window and doing nothing but be silent? In all likelihood parents might think that there is something wrong with their kid—he had a fight over something with a sibling, something in the class upset him/her, someone has taunted their kid etc. If the kid is typing something on his mobile or talking over the phone texting, playing a video game, the standard reaction of a parent is – Kids these days—and leave it at that. Instead of actively creating an atmosphere where downtime becomes a recurring event, parents are shoving technology on to kids to escape the responsibility. How else can one justify this product (iPotty)?

clip_image002

One digital critic says, “It not only reinforces unhealthy overuse of digital media, it’s aimed at toddlers. We should NOT be giving them the message that you shouldn’t even take your eyes off a screen long enough to pee.”

Many research studies have concluded that teenagers are more at ease with technologies than one another. The author argues that parents should be aware of subtle cues and create engineered absences for them to develop empathy for others via the real world interactions than avatars in the digital world.

Montaigne once wrote, “We must reserve a back shop, all our own, entirely free, in which to establish our real liberty and our principal retreat and solitude.” But where will tomorrow’s children set up such a shop, when the world seems to conspire against the absentee soul?

Confession

The author mentions the Amanda Todd incident—a teenager posts a YouTube video about her online bully and commits suicide. Online bullying is a wide spread phenomena but the social technologies offer a weak solution to the problem—flag this as inappropriate / block the user. Though the crowd sourced moderation may look like a sensible solution, in practice, it is not. The moderation team for most of the social media firms cannot handle the amount of flagging requests. The author mentions about big data tools being developed that take in all the flagging request streams and then decide the appropriate action. The reduction of our personal lives to mere data does run the risk of collapsing things into a Big Brother scenario, with algorithms scouring the Internet for “unfriendly” behavior and dishing out “correction” in one form or another. Do we want algorithms to abstract, monitor and quantify us? Well, if online bullying can be reduced by digital tools, so be it, even though it smells like a digit band-aid to cure problems in digital world. The author is however concerned with “broadcast” culture that has suddenly been thrust upon us.

When we make our confessions online, we abandon the powerful workshop of the lone mind, where we puzzle through the mysteries of our own existence without reference to the demands of an often ruthless public.

Our ideas wilt when exposed to scrutiny too early—and that includes our ideas about ourselves. But we almost never remember that. I know that in my own life, and in the lives of my friends, it often seems natural, now, to reach for a broadcasting tool when anything momentous wells up. Would the experience not be real until you had had shared it, confessed your “status”?

The idea that technology must always be a way of opening up the world to us, of making our lives richer and never poorer, is a catastrophic one. But the most insidious aspect of this trap is the way online technologies encourage confession while simultaneously alienating the confessor.

Authenticity

The author wonders about the “distance” that any new technology creates between a person and his/her “direct experience”. Maps made “information obtained by exploring a place” less useful and more time consuming, as an abstract version of the same was convenient. Mechanical clock regimented the leisurely time and eventually had more control on you than body’s own inclinations. So are MOOCS that take us away from directly experiencing a teacher’s lesson in flesh and blood. These are changes in our society and possibly irrevocable. The fact that “Selfie stick makes to the Time magazine’s list of 25 best inventions of 2014” says that there is some part of us that wants to share the moment, than actually experiencing it. Daniel Kahneman in one of his interviews talks about the riddle of experiencing self vs. remembering self.

Suppose you go on a vacation and, at the end, you get an amnesia drug. Of course, all your photographs are also destroyed. Would you take the same trip again? Or would you choose one that’s less challenging? Some people say they wouldn’t even bother to go on the vacation. In other words, they prefer to forsake the pleasure, which, of course, would remain completely unaffected by its being erased afterwards. So they are clearly not doing it for the experience; they are doing it entirely for the memory of it.

Every person I guess in today’s world is dependent on digital technologies. Does it take us away from having an authentic or more direct experience? Suppose your cell phone and your internet are taken away from you over the weekend, can you lead a relaxed / refreshing weekend? If the very thought of such a temporary 2 day absence gives you discomfort, then one must probably relook at the very idea of what it means to have an authentic experience. Can messaging a group on WhatsApp count as an authentic “being with a friend” experience? All our screen time, our digital indulgence, may well be wreaking havoc on our conception of the authentic. Paradoxically, it’s the impulse to hold more of the world in our arms that leaves us holding more of reality at arm’s length.

The author mentions about Carrington Event

On September 1, 1859, a storm on the surface of our usually benevolent sun released an enormous megaflare, a particle stream that hurtled our way at four million miles per hour. The Carrington Event (named for Richard Carrington, who saw the flare first) cast green and copper curtains of aurora borealis as far south as Cuba. By one report, the aurorae lit up so brightly in the Rocky Mountains that miners were woken from their sleep and, at one a.m., believed it was morning. The effect would be gorgeous, to be sure. But this single whip from the sun had devastating effects on the planet’s fledgling electrical systems. Some telegraph stations burst into flame.

and says such an event, according to experts, is 12% probable in the next decade and 95% probable in the next two centuries. What will happen when such an event happens?

Breaking Away

This chapter narrates author’s initial efforts to seek the absence. In a way, the phrase “seeking the absence” is itself ironical. If we don’t seek anything and be still, aren’t we in absence already? Not really if our mind is in a hyperactive state.

One can think of many things that demand a significant time investment, well may be, an uninterrupted time investment, to be precise. In my life, there are a couple of such activities – reading a book, understanding a concept in math/stat, writing a program, playing an instrument. One of the first difficulties in pursuing these tasks is, “How does one go about managing distractions, be it digital / analog distractions”? About the digital distractions- constantly checking emails/ whatsapp/twitter makes it tough to concentrate on a task that necessitates full immersion.

Why does our brain want to check emails/messages so often? What makes these tools addictive? It turns out the answer was given way back in 1937 by the psychologist, B.F Skinner who describes the behavior as “operant conditioning”. Studies show that constant, reliable rewards do not produce the most dogged behavior; rather, it’s sporadic and random rewards that keep us hooked. Animals, including humans, become obsessed with reward systems that only occasionally and randomly give up the goods. We continue the conditioned behavior for longer when the reward is taken away because surely, surely, the sugar cube is coming up next time. So, that one meaningful email once in a while keeps us hooked on “frequent email checking” activity.Does trading in the financial markets in search of alpha, an outcome of operant conditioning? The more I look at the traders who keep trading despite poor performance, the more certain I feel it is. The occasional reward makes them hooked to trading, despite having a subpar performance.

Try reading a book and catch yourself how many times you start thinking about – what else could I be doing now/ reading now?—Have we lost the ability to remain attentive to a given book or task without constantly multi-tasking. BTW research has proven beyond doubt that there is nothing called multitasking. All we do is mini-tasking. It definitely happens to me quite a number of times. When I am going through something that is really tough to understand in an ebook ( mostly these days, the books I end up reading are in ebook format as hardbound editions of the same are beyond my budget), I click on ALT+TAB –the attention killing combination on a keyboard that takes me from a situation where I have to actively focus on stuff for understanding TO a chrome/Firefox tab where I can passively consume content, where I can indulge in hyperlink hopping , wasting time and really not gaining anything. Over the years, I have figured out a few hacks that alerts me of this compulsive “ALT+TAB” behavior. I cannot say I have slayed ALT+TAB dragon for good at least I have managed to control it.

The author narrates his experience of trying to read the book, “War and Peace” , a thousand page book, amidst his hyper-connected world. He fails to get past in the initial attempts as he finds himself indulging in the automatic desires of the brain.

I’ve realized now that the subject of my distraction is far more likely to be something I need to look at than something I need to do. There have always been activities—dishes, gardening, sex, shopping—that derail whatever purpose we’ve assigned to ourselves on a given day. What’s different now is the addition of so much content that we passively consume.

Seeking help from Peter Brugman(18 minutes), he allots himself 100 pages of “War and Peace” each day with ONLY three email check-ins in a day. He also explores the idea of using a software that might help him in controlling distractions (Dr. Sidney D’Mello , a Notre Dame professors, is creating a software that tracks real-time attention of a person and sounds off an alarm) . In the end, the thing that helps the author complete “War and Peace” is the awareness of lack of absence which makes him find period of absence when he can immerse himself. I like the way he describes this aspect,

As I wore a deeper groove into the cushions of my sofa, so the book I was holding wore a groove into my (equally soft) mind.

There’s a religious certainty required in order to devote yourself to one thing while cutting off the rest of the world—and that I am short on. So much of our work is an act of faith, in the end. We don’t know that the in-box is emergency-free, we don’t know that the work we’re doing is the work we ought to be doing. But we can’t move forward in a sane way without having some faith in the moment we’ve committed to. “You need to decide that things don’t matter as much as you might think they matter,”

Does real thinking require retreat? The author thinks so and cites the example of John Milton who took a decade off to read, read, and read, at a time when his peers were DOING and ACCOMPLISHING stuff. Did he waste his time? Milton, after this retreat, wrote “Paradise Lost”, work, a totemic feat of concentration. Well this example could be a little too extreme for a normal person to take up. But I think we can actively seek mini retreats in a day/week/month/year. By becoming oblivious to thoughts like, “what others are doing?”, “what else should I be doing right now?”, “what could be the new notification on my mobile/desktop related to?” , I guess we will manage to steal mini-retreats in our daily lives.

Memory

The word memory evokes, at least amongst many of us, a single stationary cabinet that we file everything from whose retrieval is at best partial( for the trivia that goes on in our lives). This popular notion has been totally invalidated by many experiments in neuroscience. Brenda Milner’s study of Patient M is considered a landmark event in neuroscience as it established that our memory is not a single stationary cabinet that we file everything. Motor memory and Declarative memory reside in different parts of brain. Many subsequent experiments have established that human memory is a dynamic series of systems, with information constantly moving between. And changing.

Why does the author talk about memory in this book? The main reason is that we are relying more and more on Google, Wikipedia and digital tools for storing and looking up information. We have outsourced “transactive memory” to these services. In this context , the author mentions about timehop , a service that reminds you what you were doing a year ago after aggregating content from your online presence on Facebook, twitter, and blogs. You might think this is a cool thing where timehop keeps track of your life. However there is a subtle aspect that is going behind such services. We are tending to offload our memory to digital devices. Isn’t it good that I don’t have to remember all the trivia of life AND at the same time have it at the click of a button? Why do we need memory at all when whatever ALL we need is at a click of a button ? There is no harm in relying on these tools. The issue however, is that you cannot equate effective recall of information to “human memory”. Memorization is the act of making something “a property of yourself,” and this is in both senses: The memorized content is owned by the memorizer AND also becomes a component of that person’s makeup. If you have memorized something, the next time you try to access the memory, you have a new memory of it. Basically accessing memory changes memory. This is fundamentally different from “externalized memory”. In a world where we are increasingly finding whatever we need online, “having a good memory” or “memorization” skills might seem a useless skill. This chapter argues that it isn’t.

How to absent oneself?

This chapter is about author’s experience of staying away from digital world for one complete month—he fondly calls it “Analog August”. He dutifully records his thoughts each day. At the end of one full month of staying away, he doesn’t have an epiphany or something to that effect. Neither does he have some breakthrough in his work. When he resumes his life after the month long sabbatical, he realizes one thing—Every hour, Every day we choose and allow devices/services/technologies in to our lives. By prioritizing them and being aware of them is winning half the battle. By consciously stepping away from each of these connections on a daily basis is essential to get away from their spell.

imageTakeaway:

How does it feel to be the only people in the history to know life with and without internet ? Inevitably the internet is rewiring our brains. Internet does not merely enrich our experience, it is becoming our experience. By thinking about the various aspects that are changing, we might be able to answer two questions: 1) What will we carry forward as a straddle generation? 2) What worthy things might we thoughtlessly leave behind as technology dissolves in to the very atmosphere of our lives? . Michael Harris has tried answering these questions and in the process has written a very interesting book. I have thoroughly enjoyed reading this book amidst complete silence. I guess every straddle generation reader will relate to many aspects mentioned in book.

image

“Big Data” has entered every one’s vocabulary, thanks to the wild success of few companies that have used data to provide valuable information and services. This book gives a bird’s eye view of the emerging field.

The book starts off with an interesting example of the way Google predicted the spread of flu in real time after analyzing two datasets, first one containing 50 million most common terms that Americans type and second one containing the data on the spread of seasonal flu from public health agency. Google did not start with a hypothesis, test a handful models and pick one amongst them. Instead Google tested a mammoth 450 million different mathematical models in order to test the search terms, comparing their predictions against the actual flu cases. They used this model when H1N1 crisis struck in 2009 and it gave more meaningful and valuable real time information than any public health official system.

There is no formal meaning for the term, “Big Data”. Informally it means the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. It is estimated that the total amount of stored info in the world is close to 1200 exabytes(1 exa byte = 1000 GB). Only about 2% of it is analog. So, after the basic digitization efforts, the next game changer, the book predicts is going to be “Big Data”. At its core, big data is about predictions. Though it is described as a branch of computer science, “machine learning”, this characterization is misleading. Big Data is not about trying to “teach” computer to “think” like humans. Instead, it’s about applying math to huge quantities of data in order to infer probabilities.

The book talks about the three shifts that are happening in the way we analyze information :

  1. “N = all” : Gone are the days when you had constraints on getting the entire dataset. In the “small data” world, one could start with a few hypotheses, employ stats to get the right sample from the population, employ estimation theory to find the right statistic to summarize data, and then draw conclusions. This procedure is becoming irrelevant, at least a predominant part of it. Once you are dealing with the entire dataset, there is a richer structure of data available. The variations amongst subcategories can be studied to one’s heart’s content.
  2. Messy: Gone are the days when you had to crave for exact datasets. The data nowadays is stored across multiple servers and platforms and is inherently messy. Instead of trying to make it structured and make it suitable to relational database, the database technologies are evolving where “no structure” is the core philosophy. noSQL, Hadoob, Map Reduce, etc. are a testimony to the fact that database technology has undergone a radical change. With messy data, comes the advantage of breadth. Simple models with lots of data are performing better than elaborate models with less data. One of the examples mentioned in this context is the grammar checker in MS word. Instead of spending efforts in developing more efficient algos than that are already available, the guys at MSFT decided to focus efforts on building a large corpus. This shift in big data thinking dramatically increased the efficiency of algos. Simple algos were performing much better than complicated ones with large corpus of words as ammunition. Google has taken grammar check to a completely different level by harnessing big data.
  3. Correlations : “What” is important and matters more than “Why”. Causality can be bid goodbye once you have a huge datasets. Correlations that are notoriously unstable in small datasets can provide excellent patterns in analyzing big data. Non linear relationships can be culled out. Typically nonlinear relationships have more parameters to estimate and hence the data needed to make sense of these parameters becomes huge. Also the parameters have high standard errors in the “small data” world. Enter “Big Data” world, the “N=all” means that the parameters will tend to show stability. Does it mean whether it is end of theory? Not really. Big data itself is founded on theory. For instance, it employs statistical theories and mathematical ones, and at times uses computer science theory, too. Yes, these are not theories about the causal dynamics of a particular phenomenon like gravity, but they are theories nonetheless. Models based on them hold very useful predictive power. In fact, big data may offer a fresh look and new insights precisely because it is unencumbered by the conventional thinking and inherent biases implicit in the theories of a specific field.

The book goes on to describe various players in the Big Data Value chain

  • Data Holders : They may not have done the original collection, but they control access to information and use it themselves or license it to others who extract its value.
  • Data Specialists : companies with the expertise or technologies to carry out complex analysis
  • Companies and individuals with big data mindset : Their strength is that they see opportunities before others do—even if they lack the data or the skills to act upon those opportunities. Indeed, perhaps it is precisely because, as outsiders, they lack these things that their minds are free of imaginary prison bars: they see what is possible rather than being limited by a sense of what is feasible.

Who holds the most value in the big-data value chain? According to the authors,

Today the answer would appear to be those who have the mindset, the innovative ideas. As we saw from the dotcom era, those with a first-mover advantage can really prosper. But this advantage may not hold for very long. As the era of big data moves forward, others will adopt the mindset and the advantage of the early pioneers will diminish, relatively speaking.

Perhaps, then, the crux of the value is really in the skills? After all, a gold mine isn’t worth anything if you can’t extract the gold. Yet the history of computing suggests otherwise. Today expertise in database management, data science, analytics, machine-learning algorithms, and the like are in hot demand. But over time, as big data becomes more a part of everyday life, as the tools get better and easier to use, and as more people acquire the expertise, the value of the skills will also diminish in relative terms. Similarly, computer programming ability became more common between the 1960s and 1980s. Today, offshore outsourcing firms have reduced the value of programming even more; what was once the paragon of technical acumen is now an engine of development for the world’s poor. This isn’t to say that big-data expertise is unimportant. But it isn’t the most crucial source of value, since one can bring it in from the outside.

Today, in big data’s early stages, the ideas and the skills seem to hold the greatest worth. But eventually most value will be in the data itself. This is because we’ll be able to do more with the information, and also because data holders will better appreciate the potential value of the asset they possess. As a result, they’ll probably hold it more tightly than ever, and charge outsiders a high price for access. To continue with the metaphor of the gold mine: the gold itself will matter most.

What skills are needed to work in this Big Data world?

Mathematics and statistics, perhaps with a sprinkle of programming and network science, will be as foundational to the modern workplace as numeracy was a century ago and literacy before that. In the past, to be an excellent biologist one needed to know lots of other biologists. That hasn’t changed entirely. Yet today big-data breadth matters too, not just subject-expertise depth. Solving a puzzling biological problem may be as likely to happen through an association with an astrophysicist or a data-visualization designer.

The book ends with chapters on risks and control, where the authors cover a variety of issues that will have to be dealt in the “Big Data” world. The book in trying to explain the field, gives a ton of examples. Here are some that I found interesting :

  • Google Flu trends – Fitting half a billion models to cull out 45 variables that detect the spread of flu.
  • Entire dataset of Sumo Wrestlers results analyzed by freakonomics authors to cull out interesting patterns
  • Farecast, a site that helps predict the direction of air fares over different routes
  • Hadoob : Open source alternative to Google’s Map Reduce , a system to handle gigantic datasets
  • Recaptcha : Instead of typing in random letters, people type two words from text-scanning projects that a computer’s optical character-recognition program couldn’t understand. One word is meant to confirm what other users have typed and thus is a signal that the person is a human; the other is a new word in need of disambiguation. To ensure accuracy, the system presents the same fuzzy word to an average of five different people to type in correctly before it trusts it’s right. The data had a primary use—to prove the user was human—but it also had a secondary purpose: to decipher unclear words in digitized texts.

    clip_image002

  • 23and Me – DNA sequencing using BIG Data mindset
  • Billion Prices project : A project that scours web for price information and gives an indication of CPI real time. This kind of information is crucial for policy makers.
  • ZestFinance – Its technology helps lenders decide whether or not to offer relatively small, short-term loans to people who seem to have poor credit. Yet where traditional credit scoring is based on just a handful of strong signals like previous late payments, ZestFinance analyzes a huge number of “weaker” variables. In 2012 it boasted a loan default rate that was a third less than the industry average. But the only way to make the system work is to embrace messiness.
  • Endgame cracked : Chess endgames with lesser than 6 or fewer pieces on the board has been cracked. There is no way a human can outsmart a computer.
  • NY city manholes problem solved using Big Data thinking.
  • Nuance made a blunder while licensing technology to Google for the service, GOOG-411 for local search listings. Google retained the voice translation records and reused the data in to a whole gamut of services.
  • Flyontime : Visitors to the site can interactively find out (among many other correlations) how likely it is that in- clement weather will delay flights at a particular airport. The web- site combines flight and weather information from official data sources that are freely available and accessible through the Internet.
  • Decide : Price-prediction engine for zillions of consumer products.
  • Prismatic : Prismatic aggregates and ranks content from across the Web on the basis of text analysis, user preferences, social-network-related popularity, and big-data analytics. Importantly, the system does not make a big distinction between a teenager’s blog post, a corporate website, and an article in the Washington Post: if the content is deemed relevant and popular (by how widely it is viewed and how much it is shared), it appears at the top of the screen.
  • Zynga : “We are an analytics company masquerading as a gaming company. Everything is run by the numbers,”says an Zynga executive.

clip_image002[6]Takeaway:

Big data is a resource and a tool. It is meant to inform, rather than explain; it points us toward understanding, but it can still lead to misunderstanding, depending on how well or poorly it is wielded. This book gives a good overview of the major drivers that are taking us towards the Big Data world.

image

This graphic novel talks about Steve Jobs and Zen Buddhist priest Kobun, who acted as Jobs’ spiritual guru. Hard core Apple fans might like to know the kind of conversations that Jobs had with Kobun . However I felt the book was pointless. I think it is merely trying to cash in on two aspects, 1) Increasing popularity of graphic novels among adults and 2) Steve Jobs death in Oct 2011. 

image

In the introduction, the author tries to define HFT using the voices of leading HFT players. What are the characteristics of HFT? There is no agreement on a common definition of HFT, but there are some elements that one can at least attribute to HFT. They are

  • Low latency trading.
  • High turnover strategy.
  • Trader goes home `flat’ with no open position.

The goal of any HFT firm broadly is to have a set of uncorrrelated trading strategies that has statistically more winners than losers for all the trading positions of the day. The introduction mentions the fundamental market driven factors that gave rise to HFT:

  • RegNMS.
  • Order Handling Rules leading to ECNs and hence a proliferation of venues.
  • Explosion of the universe of instruments; Human market maker is giving way to computerized market maker.

The author ends his intro clarifying terms such as program trading, quant trading, algo trading, automated trading, prop trading, stat arb, ultrahigh frequency trading. I found the explanation of these terms better explained in Barry Johnson’s book ,`Algorithmic trading and DMA’.


The Emergence of High-Frequency Trading

This section lists down the significant events from 1969-2007 that were a precursor to todays’ HFT

  • 1969 Feb – Instinet Launched.
  • 1971 Feb – NASDAQ starts electronic trading in 2500 OTC stocks.
  • 1975 May – SEC bans fixed commission rates.
  • 1976 May – NYSE introduces DOT.
  • 1984 July – NYSE Approves SuperDOT.
  • 1987 – Launch of ITG Posit.
  • 1994 August – BNP purchases Cooper Neff Options trading firm.
  • 1997 – Order Handling rules approved.
  • 1998 – Launch of Attain, NexTrade, Strike Technologies, Bloomberg tradebook.
  • 1999 April – Regulation ATS becomes effective.
  • 1999 July – Goldman buys Hull for $530M.
  • 2000 – BRUT/Strike merger, Archipelago and PSX electronic exchange launched.
  • 2001- Launch of Liquidnet, Launch of Direct plus, Archipelago/REDI Book merger.
  • 2001 April – Decimalization rule hits stock exchanges.
  • 2002 – Launch of NASDAQ Super montage, Instinet – Island merger, Tradebot trades 100 million shares a day for the first time.
  • 2004 – Nasdaq acquires BRUT, Launch of Pipeline.
  • 2005 – RegNMS comes to force, NASDAQ acquires INET, NYSE acquires Archipelago, Knight acquires Attain and renames to DirectEdge ECN, Citi acquires NexTrade.
  • 2006 – NYSE buys Euronext, CME-CBOT merger.
  • 2007 – Chi-X Europe launched by Instinet. It becomes largest Multilateral trading facility in Europe. A graphic for the above events should have been there to make the content easier to read and recall. A few visuals like the would have made the chapter more readable:

C2_1

C2_3


The Path to Growth

This chapter traces the developments through the period 2007-2010 June.

  • 2007 April – General Atlantic invests $300M in GETCO.
  • 2007 May – NASDAQ announces its plans for acquiring Nordic Stock exchange.
  • 2007 July – Citi buys Automated Trading desk for $680M.
  • 2007 Oct – NASDAQ buys Boston Stock exchange.
  • 2007 Nov – MiFID goes live in Europe , NYSE announces its plans for a data center Mahwah, NJ.
  • 2008 June – `Flow traders’ receives funding.
  • 2008 June – GS Sigma X and Credit Suisse ATS platform log in record volumes of 406M and 210M shares respectively.
  • 2008 Nov – BATS becomes an exchange with its USP as speed with maker-taker model. As of today, BATS logs in 12% of the daily US equity trading volume.
  • 2008 – Citadel’s Tactical Trading fund earns $1.19B in 2009 with a team of 55 people!
  • 2009 Mar – GSET launches Sigma-X in Hongkong. It soon becomes a big hit.
  • 2009 June – NASDAQ introduces two types of flash orders , BATS introduces flash orders.

High Frequency Trading Goes Mainstream

This chapter traces the developments during 2009-2011 , the time frame when HFT entered the parlance of media, journalists, politicians, economists etc. It all began with a NYTimes article by Charles Duhigg on July 24, 2009 on HFT. That was the first time when HFT became known to wider audience and most importantly lead to false notion amongst people that flash orders were used to manipulate market.Click here  to read the full article that was published in NY Times.

Soon after this article got published, NASDAQ and BATS stopped offering flash orders. Amidst the falling economy, HFT was making money and this was met with harsh reactions from everyone. Meanwhile BATS and DirectEdge were eating in to the market share of NASDAQ and NYSE. So, to stay competitive in the game, NASDAQ launched INET in late 2009, an HFT enabled platform. NASDAQ also asked SEC to look in to various data centers that were mushrooming in the country. These data centers were heavily used by hedge funds and HFT shops. Meanwhile Chi-X was launched in Europe and Asia. Chi-X was very successful in both the venues. In Japan, it was launched under the name, `Arrow Head’ and the latest numbers from Arrow Head – 100 Million shares a day – shows that it is a massive success. There is another market development mentioned here, i.e the ban on naked access. Naked access goes under various names, ‘Sponsored Access’, ‘Unfiltered Access’, etc. All those names boil down to one type of access – Allowing hedge funds to participate in the market using broker’s market participant Id. The broker does not put any kind of controls to the orders and merely offers the platform to the clients to trade with broker’s participant Id. According to Aite group, 38% of the equities volume is from naked access. Banning this was a big jolt to all the broker-dealers.

The chapter also features the success story of GETCO, an options trading firm started by 2 floor traders in 1999, that went on to become one of the biggest success stories from Chicago. There are some interesting facts about the GETCO’s story like , it hired skilled video gamers from Illinois Institute of technology for its trading team. Well, to think of it, HFT is a kind of video game played with sophisticated technology. In Feb 2011, GETCO became a NYSE designated market maker along with the biggies like Goldman, Kellogg group, Bank Am and Barclays.The success of GETCO is also an indication of what a powerful financial and technology hotbed Chicago had become.


The Technology Race in HFT

Based on a few interviews with HFT shops, the author lists some basic requirements for a HFT shop to get going

  • A reliable and sufficiently fast data input source.
  • A robust and sufficiently fast market access.
  • Low enough transaction, clearing and processing costs as well as sufficiently effective and efficient processing and verification capability.
  • Strategy monitoring tools.

Well, if these things appear obvious, Kumiega and Ben Van Vliet of Illionois Institute of technology have developed a step-by-step methodology( using machine control theory) that addresses the needs of institutional trading and hedge fund industries for development, presentation and evaluation of high-speed trading and investment systems. They call it methodology and it says that the following stages of development should serve as a road map for running HFT shop.

  • Before even firms even start building, they should ask themselves how they are going to monitor a successful algo.
  • Build a customized database of historical data and purchase or build a tool for proper back testing of a strategy.
  • Data cleaning methods are a key component. Writing your own scrub methods is better as it gives control over the data.Procure dirty data and see whether your system works.
  • When algos go out of range,let’s say 3-4 stdev , stop the strategy and fix it, just like a plant shuts off a machine if it goes 3 stdev from the normal.
  • Backtesting is critical. However there are certain aspects you can’t back test and be sure. Latency, Market data and timing are always going to be things that are difficult to check in backtesting.
  • Fully define the functionalities and performance requirements of the trading/investment system and develop a system. Don’t tradeoff robustness for the sake of speed.
  • Put in place a reporting mechanism to capture relevant metrics for the HFT strategies.

The chapter ends with saying that, low latency is essential, but it is the strategy that needs to be robust. This is a very different view from what I heard from one of my friends last week. He is of the opinion that there are less than 10 strategies that are basically used in all the HFT shops and the speed is the ONLY thing matters. This book and other people mentioned in the book says `otherwise’. May be both matter equally!.


The Real Story Behind the “Flash Crash”

I found this chapter to be most interesting in the entire book as it goes behind the events that lead to NY Times article in July 2009 . It was in June 2009 that NASDAQ and BATS introduced flash orders to compete against DirectEdge. DirectEdge had given its participants a special order type called Indication of Interest , that helped DirectEdge add 10% to its existing market share. So, this tactic by NASDAQ was to make SEC ban such orders on the entire US market including CBOE that had roaring flash order business. SEC did not make regulatory changes as far as options are concerned. But it did make changes and subsequently flash orders were completely removed from the market.

One typically associates flash crash with some kind of HFT trading gone bad. But this chapter makes it very clear that HFT traders hate flash orders as the only entities that are profitable in a flash order are the exchanges and the guy putting the order. Infact reading this chapter makes one realize that flash crash actually goes on to show why HFT firms are needed in the first place becoz most of the HFT firms shut off their computers during the flash crash. Since there was no liquidity from HFT firms,volatility increased so much that Dow tanked 600 points and bounced back. Liquidity and Volatility have an inverse relationship and thus ‘HFT firms providing liquidity lessen the volatility’ is the argument provided `for’ HFT case. When SEC report came out investigating flash crash, it was clear that a massive market sell order of $4 Billion E-Mini contract was the tipping point for the crash in the already nervous market. Why would a trader put in such a big order as a market order and not a limit order is surprising ? Somehow the press has been flogging HFT for the flash crash , when infact, there were firms who were providing liquidity even during the flash crash. Rumor is that there is a firm in Chicago which made $100M in one day. Now given that HFT profits are close to $2B-$3B, that’s a huge amount of money.

Will there be more flash crash type of events in the future ? Certainly says the book and it says this has got nothing to do with HFT strategies.


Life after the “Flash Crash”

This section describes the various events post May 2010 flash crash

  • May 12, 2010 – Talk of HFT tax as they were falsely blamed for flash crash
  • May 14, 2010 – The first piece of investigation revealed that it was a traditional money manager who placed a $4B sell trade with out price limit, that triggered the crash
  • May 24, 2010 – Talk of eliminating stub quotes
  • May 26, 2010 – Goldman launches DMA in Latin American markets
  • June 4, 2010 – SGX announces $250M “Reach Initiative”, the fastest platform in the world that has a 90 microsecond door to door speed. NASDAQ has a door to door metric of 177 micro seconds. This proposed platform at SGX is 100 times faster than the existing infra
  • June 10, 2010 – SEC approves rules that require exchanges to stop trading in specific stocks if the price goes up by 10% or more in 5 min interval.
  • June 11, 2010 – Revolving door practice is seen at GETCO. GETCO hires Elizabeth King, a personnel from SEC. This practice of hiring people from regulatory bodies is called ‘Revolving door practice’
  • June 22, 2010 – GETCO launches its dark pool
  • July 21 2010 – DirectEdge launches two platforms, one for blackbox stat arb guys and second for passive agency order guys
  • Aug 24 2010 – Chi-X receives a takeover enquiry from BATS. Chi-X becomes the second largest trading venue after London Stock exchange in Europe
  • Sep 1, 2010 – BATS and NASDAQ stop flash orders. DirectEdge continues with Indication of Interest orders, a flash order in disguise. From the flash crash to the Waddell & Reed trade that started it all, from the discussion about stub quotes to circuit breakers, from the launch of new exchanges in the United States to GETCO’s dark pool in Europe, by November 2010, high frequency trading was on the minds of everyone in the financial world.

The Future of HFT

This sections lists the events that have happened in 2010.

  • Oct 21, 2010 – Chi-X Global announces that it will start operating in Australia by Mar 2011
  • Oct 21, 2010 – SGX makes GETCO a trading member in its securities operations
  • Oct 25, 2011 – SGX makes a $8.3 B takeover offer for the operator of Australian Bourse
  • Nov 3, 2010 – SEC voted to bar brokers from granting HFT unfiltered access to an exchange, a move aimed at imposing safeguards meant to prevent bad trades
  • Nov 8, 2010 – SEC decides to ban `stub quotes’
  • Nov 11, 2010 – Chi-East, a joint venture between Chi-X Global and SGX was launched and this became the first pan-Asian independent dark pool to be backed by a regional exchange.Chi-East is widely regarded as a landmark in the development of Asian Equities
  • Nov 24, 2010 – DirectEdge announces that it was changing the way it handles flash orders. It also indicated that it will slowly phase out flash orders
  • Dec 1, 2010 – Equiduct Systems Ltd, announced that it traded more than 1 billion euros in Nov 2010 for the first time as ATS aimed at retail brokers gathered momentum. Equiduct is an MTF operated by Berlin bourse.
  • Dec 6, 2010 – ASX that was selling itself to SGX said that the decision to merge with SGX was in the best interests of Australia’s market micro structure development. Collectively the combined SGX – ASX entity would list companies worth $1.9T, fourth in Asia behind Tokyo, HongKong, and Shanghai.

The book is interspersed with interviews with 6 High frequency traders.

Meet the Speed Trader : John Netto

image

In this chapter, John Netto , a high frequency trader talks about his trading experiences and HFT in general. Here are some of the aspects he mentions,

  • The creation of HFT strategy was the easy part. Issues such as how one will raise capital and how one will handle dealing with compliance, regulatory, reporting, and developing infrastructure are all questions not asked at first but which will become a critical part of the equation of profitability
  • Low latency is very important for HFT but the degree of importance varies based on strategy.
  • In terms of infra, I would say buy , than build so that you can focus on strategies to make money.
  • Sophisticated systems can be thought of as the assistants to the traders, not replacements to the traders, because after all, strategy, structure and people are key to success.
  • HFT is going to get bigger, stronger and more prevalent. More traditional investment managers will explore HFT in the times to come.
  • Always have a few strategies that you think , make money on paper at least.
  • I work out daily. Fitness and being active are a very important part of success . Trading requires you to be mentally alert and hence you have to be be fit.

Meet the Speed Trader : Aaron Lebovitz

image

Aaron Lebovitz spent two decades in the industry and then started his own firm, Infinium Capital management, a prop shop in 2003 and made it in to an exceptional force in the industry. Here are some of the aspects he mentions,

  • Hook up with a firm that has a proven track record, and make sure to know how to write code.
  • Integration of formal models in to trading decisions is the kicker you get in applying quant.
  • Started off with pairs as his first high-freq strategy.
  • His mantra is, `Developing new markets, increasing global market liquidity, and driving efficient price discovery’
  • Infinium trades 23.5 hours 6 days a week. It did shut down its machines on the day of flash crash
  • It is not all about speed. Having subtle or good strategies will get you alpha.
  • The future of HFT is still very much in doubt, hinging on an uncertain regulatory environment.
  • I don’t foresee traditional investment managers shifting their focus to becoming market makers or developing statistical arbitrage strategies

Meet the Speed Trader : Peter van Kleef

image

Peter Van Kleef has spent the last 15 years running automated trading ops. Here are some of the aspects he mentions,

  • It never really crossed my mind why someone would want to trade any other way than HFT. It is just so much more efficient
  • Start writing your own trading strategies and see whether they make money on paper. Think about fancy infra later.
  • Trying to apply solutions from other fields of science and research is also a start when exploring HFT.
  • Whenever possible, I try to balance work with some sort of physical exercise. It is MUST for being a better trader.
  • Low latency is important but is not the be-all and end-all.
  • It is time for people to stop competing for speed and start producing alpha.
  • A lot of the profits that are sustainable usually will come from the strategy and not necessarily from technology.
  • Leverage is the key driver because returns are quite low.
  • The overall profit largely depends only on the number of trades that can be done because the average profit is nearly certain. I don’t know many other strategies that, if run well, consistently range about 50% per year and above. We are not in the business to rate ourselves; high-frequency is very profitable, if done right. That much is sure.
  • Sophisticated retail investors are already running HFT strategies.
  • HFT of the yesteryears is becoming a commodity as you can buy all the infra that is needed.
  • Part of what will determine the future of the industry is what regulatory changes legislators and SEC will introduce.

Meet the Speed Trader : Adam Afshar

image

Adam Afshar’s views on HFT are little different from the traders covered earlier. He is of the opinion that it is the low latency that is extremely critical. Here are some of the aspects he mentions,

  • HFT is a method or a tool. It is not a strategy. HFT methods are applied to three types of strategies – Market Making, Stat arb and Algorithmic execution.
  • My HFT shop uses models where there is absolutely no human intervention in the execution. I believe this should be the way, as human order execution does not allow you to collect data and you can’t back test. If you use an algo for execution, one can collect all types of data and then improve it. Your reliance of specific individuals is also less
  • High-speed data management is the linchpin of a success HFT shop.
  • You can’t avoid latency issue. Its a prerequisite for whatever you do in HFT world.
  • The biggest revenue generating idea in my shop is , one that scours 110 million news items and takes positions intra-day
  • HFT popularity will slowly fade away. It will soon become a commodity once tech makes the landscape flat.
  • My advice to students is to take science and liberal arts subjects as they teach how to think. You can figure what you want to do later.

Meet the Speed Trader : Stuart Theakston

image

Stuart Theakston runs GLC, an HFT firm. Here are some of the aspects he mentions,

  • High frequency traders simply replace specialist/jobbers in providing liquidity in a much more competitive framework.
  • HFT traders are not getting a free lunch by installing heavy duty infra. Infact all they are doing is eat each other’s lunches.
  • An unfortunate way in which flash crash was handled by the exchanges : Some of the trades were canceled. This gives all the more a reason for HFT traders to be away from the market during crisis.
  • I spend 70% of the time with quants, traders and programmers and 30% of the time watching market and doing research myself. I read a lot of academic papers and journals, etc.for they provide valuable source of ideas.
  • Most of the people in the media, political circles, policy makers are clueless about the function of HFT. Volatility reduction demands Liquidity and HFT guys play a crucial role here. The policy makers have the potential to kill this entire HFT crowd if they don’t understand issues properly.
  • Order-anticipation and cross-venue arb are sensitive to latency, liquidity provision is less so.
  • Prop trading firms can spend 99 cents on infra to make 1 dollar. Hedge funds can only spend 19 cents of the dollar given the structure. So these participants cannot compete in the same space
  • HFT is nearing the bottom in terms of competition to be the fastest. The opportunities are outside equity asset classes like Forex, Exchange traded CDS and developed markets.
  • Suppliers will spring up who try to sell HFT trading in a box to the masses. But the retail users won’t make any money, at least on a properly risk-adjusted basis
  • Most of the development in HFT have already happened in developed-market equity and equity derivative markets, and these are approaching maturity. These markets are likely to be the domain of increasingly specialist outfits. The low hanging fruit has been picked. The next wave is likely to be in emerging markets and esoteric ETFs. This is where people developing their careers in this space should focus.

Meet the Speed Trader : Manoj Narang

image

Manoj Narang started his firm in 1999 that was in to providing financial toolbox. In the last few years or so, he has transformed his firm in to a profitable HFT firm.Here are some of the aspects he mentions,

  • Its the law of large numbers at play in HFT. If you have a strategy that works 55 percent of the times, you can’t do a few trades. You have to do a ton of trading to translate that edge in to sustainable profits.
  • We generally are buyers of hardware and builders of software and systems.
  • We have never had a a losing week since we started in the business.
  • A successful high frequency strategy will have a Sharpe ratio higher than 4 and a successful HFT operation that runs multiple strategies generally will have a double-digit Sharpe ratio. Such high Sharpe ration are unheard of in traditional investment places.
  • Back of the envelope calculation shows that the profitability of the entire HFT industry is around $2B. Not much in relative terms. There are many other corners of the financial industry such as derivative trading, that generate hundreds of times this amount of profit in one year.
  • The main use of outside capital in the world of HFT is to fund research and development, operations, not to actually trade the capital. Very little capital is required for trading purpose.
  • Once you are an electronic market maker, it is skills that become important and not status or power or connections.
  • Market making function in equities is decentralized. It doesn’t mean everyone should be a market maker. Just because everyone needs food, doesn’t mean everyone becomes a grocer. Similarly individuals should seek to be investors and not liquidity providers.
  • Wall street has little to do with HFT.
  • I think the main risk in the financial markets is that the capital is getting increasingly concentrated, which makes herd like behavior even more prevalent. The frequency with which bubbles inflate and exploder is getting more and more rapid as a result because massively capitalized investors jump from one asset class to another asset class, leaving devastation in their wake
  • HFT does not connote systemic risk.The only systemic risks to the market are the ones posed by herd-like behavior.
  • Tradeworkx started HFT in 2009 and this 50 member shop account for 3% of overall volume in SPDR ETF.
  • One must keep in mind that modeling systemic correlations is VASTLY different from modeling structural correlations.
  • How do I see our fund in five to ten years? I have no idea; five or ten years are an eternity when you are immersed in a world where microseconds matter.

imageTakeaway :

If you take all the events that have happened relating to HFT world and list them in a chronological order, spice it up with some HFT trader stories and viewpoints, what you get is, this book.

image

I went over this book to get some idea on STP. I was completely clueless on the things that happen at the back end, once the trade is placed with a broker. STP was introduced in Indian Markets by FT and few other vendors almost a decade ago. It has lead to a clear improvement in operational efficiency. Even though this book talks about .NET platform, I liked the chapter on STP that explained things using business case and .NET code.

STP

  • STP(Straight Through Processing) providers have the responsibility of carrying and delivering STP-related instructions and messages. They provide connectivity between brokers, custodians and fund managers, and bring them to a common network. Once the entities enter the common network, they communicate with each other using standard messaging protocols like Swift 15022.
  • In India FT is the leader in this game, with STP gate solution being used by everyone.
  • ISO 15022 is an acceptable communication protocol for achieving STP interoperability in several countries
  • STP was initiated as a part of T+1 initiative by the equities market
  • Swift 15022 terminal is the source through which all the market participants get on to this platform.

clip_image002

  • Cryptography is critical in this entire STP layer
    • Substitution Cipher can be used
    • Transposition Cipher can be used
    • Symmetric algos and Asymmetric algos are used to protect the confidentiality of the data. I think asymmetric algo is used in FT case
  • Digital Signature addresses non-repudiation issues encountered in high risk situation.

STP Interoperability

  • If there are multiple STP platforms, SOA can be used as a model for different STP platforms to work.
  • Web services are being used so that STP platforms can talk to each other.SOAP is the messaging communication framework used by Web services to send or receive messages

The entire structure of the book can be summarized using this visual.

clip_image004

Will go over the contents in greater detail at sometime in the future.

Next Page »