Programming


Learn-git

Git and Github have revolutionized the way one creates, maintains and shares software code. It is said to be the Linus Torvald’s second gift to the world, first obviously being the Linux operating system. Nowadays it is common for job seekers to showcase their work in the form of several github repositories so that various employers can evaluate the job seeker in a much better way. Open source projects are thriving because of easy to use git based social coding platforms. The popularity of these platforms has grown to such an extent that many non-programmers are using git and github for maintaining version control of their work. Personally I know two nonfiction writers/journalists who use git to maintain their various documents.

My guess is that the transition from a newbie to a moderately skilled user of github might take a week or two. It is easy to mug up commands and just use it. But if you want to understand the way git does certain things and why it does certain things, it would be a good idea to spend some time understanding each of the main ideas of git. There are many tutorials, videos, screencasts, podcasts available on the internet. They may be a good starting point for a newbie. There are also book length treatment given to Git, Github etc. Out of the massive content that is available to understand Git, I think this book is one of the best introductions, the reasons being,

  • Each main idea is condensed in to a one hour lesson, meaning that someone else has thought through the process of "what’s the ideal chunking necessary to understand git"?
  • Each idea is illustrated via visuals, i.e the ideas that you glean from the book stick with you longer
    Easy to understand examples through out the book
  • There is a lab at the end of every chapter. One cannot learn unless one practices, more so when you are completely new to the subject
  • Examples are reused all over the book so that there is some sort of reinforcement of the previous ideas

In this post, I will summarize the main points of each chapter.


Chapter 1 – Before you begin

Firstly, the rationale of title. Each chapter is designed to be read during your lunch hour, not literally though. It basically means that all you need to read any chapter in the book is an hour of your time. In that hour, you should be able to read the text and go through each of the "Try it now" exercises for a particular chapter. The exercises at the end of each chapter will help reinforce the points of the chapter

The author suggests a learning path that a reader could follow, i.e. one chapter each day for 20 days. There are no pre-requisites to reading this book. Any novice can pick up the material covered in the book, if he chooses to allocate 20 hrs. My guess is that even if you manage to spend 14 hours on this book, it should make you conversant with the git workings and should turn you in to a moderately skilled git user.


Chapter 2 – An Overview of Git and Version Control

If you try coding stuff, then you typically have files that you will modify and save. You might save the file multiple times but a set of saves might warrant a comment. For a reader who isn’t well versed with version control, she might try to incorporate comments in the file name itself by suffixing or prefixing the file name. This type of managing files doesn’t scale well. Hence the need for a versioning system.

Every version control system has three concepts

  • Versioning
  • Auditing
  • Branching

The power of Git comes from the following features:

  • Distributed repositories: Each developer has his own repository that she can commit to. There is no problem of taking a backup as every user of the centralized repository has a full working repository. The basic difference between previous version control systems and git is that the latter is a DVCS (Distributed version control system). This means that you don’t need to run a Git server to get all its benefits. You don’t even need a network to run Git’s commands. Every developer is given a version control of the repository.
  • Fast branching: One of the ways to keep the distinction between development and production code is to separate it out in to two folders. One has to switch between folders to know the appropriate folder to work on. Git makes branching extremely fast. Internally it manages branching by a set of pointers. There is no need to copy files and other things. The speed with which you can create a new branch and begin working creates a new model for doing work : if you want to try a new idea in your code, create your own branch. Because you’re working in a local repository, no other developer is disturbed by this new code stream. Your work is safe and isolated.
  • Staging area: There are situations where you want a specific code to be used while developing but a sanitized code to be used in the production. For example, the username and password could be hard-coded in development stage but not in production. Git has a concept of staging where you stage the file but commit a sanitized version of the file to the repository

The author gives a tour of Git via GUI interface as well as CLI. Towards the end of the chapter, the author lists down the various terms that one comes across while using git

  • Branch
  • Check out
  • Clone
  • Commit
  • Distributed
  • Repository
  • Staging area
  • Timeline
  • Version Control

 

Chapter 3 – Getting Oriented with Git

The syntax for using any git command is

git [switches] <command> [<args>]

switches are optional arguments, command is the git command and args are the arguments to the git command.

For example in the following command,

git -p config –global user.name "RK"

In the above command, -p is the switch to paginate output if needed, config is the git command, –global, user.name, "RK" are three arguments

This chapter introduces basic command line functions that are used to create, remove, rename files and directories. The most important function of this chapter is to make the user set the global user.name and global email setting. These two values plus a bunch of other stuff will be used by git to create SHA1 for the commit objects.

The following are the commands mentioned in the chapter :

Command   Description
git config –global user.name "Your Name" Add your name to the global configuration
git config –global user.email "Your email"    Add your email to the global configuration
git config –list    Display all the git configurations    
git config user.name Displays the user.name configuration value
git config user.email Displays the user.email configuration value
git help help    Ask git to help about its help system
git help -a    Print all the git available commands
git –paginate help -a    Paginate the display of all the git available commands
git help -g     Print all the git available guides
git help glossary Display the git glossary



Chapter 4 – Making and Using a git repository

This chapter introduces the basics of creating and using a git repository. git init creates a repository on your local machine.

There are two things to keep in mind

  • No server was started
  • The repository was entirely local

The init commands creates a special folder called .init and it contains a host of folders for managing the commit objects, trees, references etc. There is a difference between working directory and repository. The working directory is the place where you do your work. The repository is a specialized storage area in which you can save versioned files. The repository lives inside the working directory.

For any file you create in the working directory, you need to make it git aware. This can be done via git add command. Once git add is run on a file, Git knows about your file and tracks changes to it. But since the file is not committed, there is no time information that is recorded in git. A good way to to imagine adding a file to git is, putting a file in a queue called staging area. It appears in the repository only after one commits the file with the relevant comment.

The following are the commands mentioned in the chapter :

Command   Description
git init Initialize a git repository in the current repository
git status Display status of current directory, as it relates to Git
git add FILE start tracking FILE in Git; adds FILE to the staging area
git commit -m MSG Commit changes to the git repository, with a message in quotes
git commit -a -m MSG Adds the unstaged files and creates a new commit object
git log     Display the log history
git log –status Displays the log with the files that were modified
git ls-files List the files in the repository

 


Chapter 5 – Using Git with a GUI

This chapter uses GUI to create/add/commit to the repository. I have used Cola Git Gui to explore the various lessons in this chapter. Towards the end of the chapter, the author touches upon Tcl/Tk.Tcl is a dynamic interpreted language invented in 1988 by John Ousterhout. Tk, a toolkit of GUI controls, was added to the language not long after. Both Git Gui and gitk are written in Tcl/Tk.


Chapter 6 – Tracking and Updating files in Git

The author introduces "staging area" in Git via the following analogy:

Pretend that your code is an actor in a theater production. The dressing room is the Git working directory. The actor gets a costume and makeup all prepared for the upcoming scene. The stage manager calls the actor and says that the scene is just about to start. The actor (the code) doesn’t jump in front of the audience right then. Instead, it waits in an area behind the curtain. This is our Git staging area. Here, the actor might have one last look at the costume or makeup. If the actor (the code) looks good, the stage manager opens the curtains, and the code commits itself to its performance.

Whenever you change anything in the working directory, that change has to be reflected in the staging area. This staging area can be committed to git. The author shows step by step procedure of adding a file to the staging area, committing the file, checking the log messages, figuring out the difference between staged file and the file in the working directory etc.

The following are the commands mentioned in the chapter :

Command   Description
git commit -m "Message" commit changes with the log message entered on the command line via -m switch
git diff Show any difference between tracked files in the current directory and the staging area
git diff –staged Show any difference between the files in the staging area and repository
git commit -a -m "Message" Perform git add, and Perform git commit with the given message
git add –dry-run Show what git add would do
git add . Add all new files to the git repository
git log –shortstat –oneline Show history using one line per commit, and listing each file changed per commit


Chapter 7 – Committing parts of changes

The way to delete a file from the repository is to remove the file from the staging area first and then commit to the repository. If you use bash command rm it will only remove the file from the working directory. To remove the the file from staging area, use git rm. This removes the file from the staging area as well as the current directory. The same logic applies to renaming files too. Use git mv command to rename files in the staging directory as well as the working directory. In a way it might seem like staging area is pretty redundant. However it is extremely useful in committing partial files. You can choose the portion of files that you want to stage by using git add -p filename. This will throw a list of hunks that you can choose to stage or ignore. It took me sometime to get used to understand this functionality. The other aspect that is covered in the chapter is about commit. When to commit ? It makes sense to commit to the repository under any of these conditions:

  • Adding or deleting a file
  • Renaming a file
  • Updating a file to a known good working state
  • When you anticipate being away from the work
  • When you introduce some questionable code

However the author feels that since all the commits are local to user machine, it is better to commit as frequently as possible

The following are the commands mentioned in the chapter :

Command   Description
git rm file Remove file from the staging area
git mv file1 file2 Rename file1 to file2 in the staging area
git add -p Pick parts of your changes to add to staging area
git reset file Reset your staging area, removing any changes you have done with git add
git checkout file Check out the latest committed version of the file in to your working directory


Chapter 8 – The time machine that is Git

Each commit has a unique SHA1 ID associated with it. This code is generated based on author’s email, time of the commit, the files in the staging area and previous commit SHA1. The fact that it is based on previous commit SHA1 means you can traverse the entire version tree via the latest commit SHA1. No two commit objects will ever share a common SHA1 ID. At the beginning of the project, HEAD and master point to the same version. As you keep doing commits, master points to the latest commit and so does HEAD. However once you checkout a particular version, then the HEAD moves back in time to that particular version. One of the easy ways to refer to SHA1s are by using tags. You can set a particular SHA1 a specific tag that you can use it later for quick checkout.

The following are the commands mentioned in the chapter :

Command   Description
git log –parents Show the history, displaying the parent commit’s SHA1 ID for each commit
git log –parents –abbrev-commit Same as the preceding command, but shorten the SHA1 ID
git log –oneline Display history concisely using one line per each commit
git log –patch Display the history, showing the file differences between each commit
git log –stat Display the history, showing a summary of the file changes between each commit
git log –patch-with-stat Display the history combining patch and stat output
git log –oneline file_one Display the history for file_one
git rev-parse    Translate a branch name or tag in to a specific SHA1
git checkout your_sha1id change your working directory to match a specified sha1id       
git tag tag_name -m "message" sha1id create a tag named tag_name, pointing to your sha1id
git tag List all tags
git show tag_name Show information about the tag named tag_name


Chapter 9 – Taking a fork in the road

Branching is one of the most important concepts in git. Typically you start with a master code and as time goes, you keep creating divergent code bases. Each of the divergent code base could represent a bug fix, an enhancement, a new feature, etc. Each branch has a reference called master that refers to the latest commit in that specific branch. There is also a reference by name,"HEAD", that refers to the commit of the checked out commit. If the checked out code and latest commit represent the same set of files, then master and HEAD point to the same commit object. One often forgets that SHA1 for every commit object includes the information of its parent object. Once you create branches, one obviously needs to know commands to

  • switch to another branch
  • list down all the branches
  • create a DAG showing all the branches
  • Difference between the codebase between two branches
  • Creating and checking out a branch in single line of code

The author gradually introduces commands to do all the above. He concludes the chapter after introducing git stash and git pop commands.

The commands mentioned in this chapter are :

Command   Description
git branch List all branches
git branch dev Create a new branch named dev
git checkout dev Change your working directory to the branch named dev
git branch -d master Delete the branch named master
git log –graph –decorate –pretty=oneline –abbrev-commit View history of the repository across all branches
git branch -v    List all branches with SHA1 information
git branch fixing_readme YOUR_SHA1ID Making a branch using YOUR_SHA1ID as the starting point
git checkout -b another_fix_branch fixing_readme Make a branch name another fix_branch using branch fixing_readme as the starting point
git reflog Show a record of all times you changed branches     
git stash Set the current work in progress to stash, so you can perform a git checkout
git stash list List works in progress that you have stashed away
git stash pop Apply the most recently saved stash to the current working directory, remove it from stash


Chapter 10 – Merging Branches

"Branch often" is the mantra of a git user. In that sense, merging the created branch with the master or any other branch becomes very important. Branching diverges code base and Merging converges code base. Using the pneumonic "traffic merges in to us", the author reinforces the point that git merge command is used to merge other branches in to the branch we are on. A merge results in creating a commit object that has two parent commits. One of the most useful commands to explore the master branch commit structure is

git log –graph –oneline –decorate –all –parents –abbrev-commit

In any merge, there is a possibility of conflicts between the code bases. The conflicts can be resolved by opening the conflict files, choosing the appropriate hunk, and creating a new commit by merge. The author shows the steps to do a git merge via UI tools. The chapter ends with the discussion of fast-forward merge. This type of merge arises when you the target branch is a direct descendant of the branch that it will merge with. Git also has the ability to merge multiple branches, the jargon for such a task is, "octopus merge".

The following are the commands mentioned in the chapter :

Command   Description
git diff BRANCH1…BRANCH2 Indicate the difference between BRANCH1 and BRANCH2 relative to when they first became different
git diff –name-status BRANCH1…BRANCH2 Summarize the difference between BRANCH1 and BRANCH2
git merge BRANCH2 Merge BRANCH2 in to the current branch that you’re on
git log -l    A shorthand for git log -n 1
git mergetool open a tool to help perform a merge between two conflicted branches
git merge –abort Abandon a merge between two conflicted branches
git merge-base BRANCH1 BRANCH2    Show the base commit between BRANCH1 and BRANCH2


Chapter 11 – Cloning

When you typically want to share your code, you can either copy your working directory code and send it across OR in the git’s world, host your repository for others to clone it. In the first approach, all your version control is lost. The receiver has no way to track changes that you make in your code after the code has been shared. In the second approach, all your version history is intact and anyone can clone your directory to get access to the entire history of commits. The crucial advantage of cloning is that the copy is linked to the original repository and you can send and receive changes back to the original.

When you clone a directory, the only branch that appears in the clone is the active branch from the original repository, i.e the branch that is pointed by HEAD. When you look at the tracking branches in a repository cloned from another one, you see a strange naming convention such as remotes/origin/branch_name. For each branch on the remote repository, git creates a reference branch.The remote-tracking branches, like regular branches, point to the last commit of that line of development. Because every commit points to its parent, you can see how you have the entire history. If you want to develop code by working on any reference branch, you checkout the branch in the usual way using git branch and it creates a branch off the remote tracking branch

The author introduces bare directory, i.e. a standalone directory that contains only a git repository and nothing else. An important aspect of a bare directory is that it has no reference to the original repository. Unlike a clone, which has a reference to its originating repository, the bare directory is a completely standalone repository. Because of this, and the fact that it has no working directory, bare directories are often the official copy of a repository. The only way to update it is to push to it, and the only way to retrieve its contents is to clone, or pull, from it.

The following are the commands mentioned in the chapter :

Command   Description
git clone source destination_dir Clone the Git repository at source to the destination_dir
git log –oneline -all Display all commit log entries from all branches
git log –simplify-by-decoration –decorate –all –oneline Display the history in a simplified form
git branch -all Show remote-tracking branches in addition to local branches
git clone –bare source destination_dir Clone the bare directory of the source directory into the destination_dir
git ls-tree HEAD Display all the files for HEAD


Chapter 12 – Collaborating with Remotes

This chapter talks about creating references to one or many remote repositories. The remote could be a single or multiple repositories. These remotes could reside anywhere on the network. Once you set up a remote and clone the repository, you are all set to send and receive changes from the remotes. The usual word attributed to remote repository is "origin". However you can change it to refer to any word that sticks with your mental model.

The following are the commands mentioned in the chapter :

Command   Description
git checkout -f master checkout the master branch, throw away any changes in the current branch
git remote Displays the name of the remote directory
git remote -v show Displays the names of the remotes along with the corresponding URL
git remote add bob ../math.bob Add a remote names bob that points to the local repository ../math.bob
git ls-remote bob Display the references of a remote repository
GIT_TRACE_PACKET git ls-remote REMOTE Display the underlying network interaction


Chapter 13 – Pushing your changes

git push is a command that affects another repository besides your own. Once you are done with the changes in your local repository, you might want to share your code with a remote repository. In the case where the remote repository has not changed, the code can be easily merged via a fast-forward merge. If you get a conflict in pushing code, you need to fix your local repository by pulling changes from the remote and then pushing your code. If you create a new branch in your local repository and then try to push your code, git will crib. You have to use –set-upstream switch so that git creates a branch on the remote and then pushes the code to it. The author also explains the way to delete branches on the remote. It is a two step process where you first delete the branch from the local repository and then use a specific syntax to push to the remote, post which the branch on the remote is also deleted. The last section of the chapter talks about pushing and deleting tags on the remote.

The following are the commands mentioned in the chapter :

Command   Description
git push origin master Push the master branch to the remote name origin
git push Push the current branch to the default remote-tracking branch set up by git checkout or git push –set-upstream
git push –set-upstream origin new_branch create a remote tracking branch to new_branch on the remote named origin
git config –get-regexp branch List all the git configuration settings that have the word branch in the variable name
git branch -d local branch Remove the local branch named local branch
git push origin :remotebranch Remove the branch named remotebranch from the remote named origin
git tag -a TAG_NAME -m TAG_MESSAGE SHA1 create a tag to the sha1 with the name tag_name and the message tag_message
git push origin TAGNAME Push the tag named TAGNAME to the remote named origin
git push –tags Push all the tags to the default remote
git push origin :TAGNAME Delete the tag named TAGNAME on the remote named origin
git tag -d TAGNAME Remove the tag named TAGNAME from the local repository


Chapter 14 – Keeping in sync

The rationale for syncing is simple – git will not allow you to push your code to the remote until your local repository is in sync with the remote. git pull is a two part operation. git pull comprises git fetch and git merge. The first step comprises fetching the remote repository and seeing to it that your repository look like remote repository. This overlays all the commits from the remote repository on to the working repository. The crucial thing to note is the pointer by name FETCH_HEAD that points to the most recent remote tracking branch that was fetched. When git merge is done on your working branch, you use the FETCH_HEAD pointer to merge in all the changes of the same branch on the remote.

The following are the commands mentioned in the chapter :

Command   Description
git pull Sync your repository with the repository that you cloned from. This comprises git fetch and git merge
git fetch The first part of git pull . This brings in new commits from the remote repository and updates the remote-tracking branch
git merge FETCH_HEAD Merge the new commits from FETCH_HEAD in to the current branch
git pull –ff-only The -ff-only will allow a merge if FETCH_HEAD is a descendant of the current branch


Chapter 15 – Software archaeology

This chapter gives elaborate explanation of various switches that go with the git log command. Detailed explanations are given for understanding gitk view configurations.

The following are the commands mentioned in the chapter :

Command   Description
git log –merges List commits that are a result of merges
git log –oneline FILE List commits that affect FILE
git log –grep=STRING List commits that have STRING in the commit message
git log –since MM/DD/YYYY –until MM//DD/YYYY List commits between two dates
git shortlog Summarizes commits by various authors
git shorlog -e Summarizes commits by various authors including email
git log –author=AUTHOR List commits by AUTHOR
git log -stat HEAD^..HEAD List the difference between the current checked out branch and its immediate parent
git branch –column List the branches by column name
git name-rev SHA1 Given a SHA1, it gives the name of the branch
git grep STRING Find all the files with the given STRING
git blame FILE Display blame output for a FILE


Chapter 16 – Understanding git rebase

It is often the case that the checked out branch that you are working in the local directory goes out of sync with the remote master because of a collaborator committing it to the remote master. If you want to push your branch on to remote, git will crib. One of the ways to deal with this situation is to use git rebase. This command alters the history of your local directory by downloading the remote repository commit and then adding your changes as the descendant of the HEAD branch of the downloaded commit.The most important reason for using git rebase is to change the starting point of your local branches. In case there is an accidental rebase, one can always use git reflog and reset the head to the point at the relevant SHA1 ID. The chapter concludes by introducing git cherry-pick that can copy a specific commit to the current branch.

The following are the commands mentioned in the chapter :

Command   Description
git log –oneline master..new_feature Show the commits between the master branch and the feature branch
git rebase master Rebase your current branch with the latest commit from master
git reflog Display the reflog
git reset –hard HEAD@{4} Reset HEAD to point to the SHA1 ID represented by HEAD@{4}.
git cherry-pick SHA1 ID Copy the commit to the current branch you are on


Chapter 17 – Workflows and branching conventions

This chapter discusses the unwritten rules, policy and convention relating to git.

  • Try to keep the Git commit subject under 50 characters
  • It might make sense to limit users who are given rights to push the code
  • Standardize the name of branches
  • Depending on whether there is a need to maintain the history of every commit or not, one might want to use git rebase or not
  • Standardize the name of the tags that can be used

The author explains two workflows that are popular amongst git users

  • git-flow : There are two main branches in a git-flow repository. Other branches such as feature and release are created temporarily and then deleted when finished. The master branch contains released production-level code. This is what the public can see, perhaps on a deployed website or in some released software that they’ve feature downloaded from you.The develop branch release contains code that is about to be released
  • GitHub flow: There is one master branch that is forever alive. There are feature branches that are brought in to existence whenever required. Once the feature is developed, it is merged in to the master branch. Unlike git-flow workflow, the branches are not deleted in this type of work-flow.

The following are the commands mentioned in the chapter :

Command   Description
git commit –allow-empty -m "Initial commit" Create a commit without adding any files
git merge –no-ff BRANCH Merge BRANCH in to the current branch, creating a merge commit even if its a fast-forward commit
git flow A git command that becomes available after installing gitflow


Chapter 18 – Working with Github
Github is a service that hosts git repositories. These repositories are typically bare directories and they contain all the version control related files and folders. The way to go about creating a github repository is via an UI on the github website. Once the bare directory is created, it is ready to be used. One can add the URL of the bare repository using git add remote command and then the rest is same as communicating with any remote repository. All the commands such as push, fetch, merge, pull remain the same. The power of github lies in widespread collaboration on a single project. If you want to contribute to a github repository XYZ, the first thing one needs to do is to fork it. A fork creates a replica of the XYZ and this can serve as your own private space to play with the entire repository. You can clone it on to your local machine, hack it, develop on the code etc. There is one key element that needs to kept in mind. All the changes that you push on to the github will only be present in your fork. They will not be reflected in the original XYZ repository unless you send a request to the XYZ owner and the owner accepts your pull request. Github has a cool UI that enables any developer to send pull requests to owner. It also has many features that enable an owner to keep track of various pull requests, maintain wiki and much more.

The following are the commands mentioned in the chapter :

Command   Description
git remote add github https:/// Add a rename named github that points to your math repo on github
git push -u github master push your master branch to remote identified by github
git clone http:/// Clone your github repository named math


Chapter 19 – Third Party Tools and Git

I did not go through this chapter as I do not foresee, at least in the near future, using IDE plugins mentioned in the chapter, i.e. Atlassian’s SourceTree, EclipseIDE integration


|Chapter 20 – Sharpening your Git

This chapter urges the reader to explore the configuration files. There are three levels at which config options can be set. First at the local or repository level. Second at the global level and third at the System level. The switches used to access each of the three levels are –local, –global, –system. Each configuration is specified as name=value pair. The author explain ways to configure various IDEs with git like notepad++, nano etc. The author concludes the chapter by giving some general directions for continually learning git.

The following are the commands mentioned in the chapter :

Command   Description
git config –local –list List the local git configuration
git config –global –list List the global git configuration
git config –system –list List the system git configuration
git -c log.date=relative log -n 2 Show the last two commits using the relative date format
git config –local log.date relative Save the relative date format in the local Git configuration
git config –local –edit Edit the local Git configuration
git config –global –edit Edit the global Git configuration
git config –system –edit Edit the system Git configuration
git -c core.editor=echo config –local –edit Print the name of git configuration file     
git -c core.editor=nano config –local –edit Edit the local git configuration file using nano
git config core.excludesfile Print the value of the core.excludesfile git configuration settings

 

TakeawayTakeaway

This book is an excellent book to learn git for someone who is short on time. Each chapter takes an hour and depending on one’s requirement, one could select the relevant chapters of the book, read it, practice the lab exercises and become a moderately skilled git user. Highly recommended book for a git newbie.

book_cover

This book gives a a macro picture of  machine learning. In this post, I will briefly summarize the main points of the book. One can think of this post as a meta summary as the book itself is a summary of all the main areas of machine learning.

 

Prologue

Machine learning is all around us, embedded in technologies and devices that we use in our daily lives. They are so integrated with our lives that we often do not even pause to appreciate its power. Well, whether we appreciate or not, there are companies harnessing the power of ML and profiting from it. So, the question arises, whether we need to care about ML at all ? When a technology becomes so pervasive, you need to understand a bit. You can’t control what you don’t understand. Hence at least from that perspective, having a general overview of the technologies involved, matters.

Machine learning is something new under the sun: a technology that builds itself. The artifact in ML is referred to as a learning algorithm. Humans have always been designing artifacts, whether they are hand built or mass produced. But learning algorithms are artifacts that design other artifacts. A learning algorithm is like a master craftsman: every one of its productions is different and exquisitely tailored to the customer’s needs. But instead of turning stone in to masonry or gold in to jewelry, learners turn data into algorithms. And the more data they have, the more intricate the algorithms can be.

At its core, Machine learning is about prediction: predicting what we want, the results of our actions, how to achieve our goals, how the world will change.

The author says that he has two goals in writing this book:

  • Provide a conceptual model of the field,i.e. rough knowledge so that you can use it effectively. There are many learning algorithms out there and many are being invented every year. The book provides an overview of learning algos by categorizing the people who use them. The author calls each category a tribe. Each tribe has its own master algorithm for prediction
    • Symbolists: They view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic.
      • Master algorithm for this tribe is inverse deduction
    • Connectionists: They reverse engineer the brain and are inspired by neuroscience and physics.
      • Master algorithm for this tribe is backpropagation
    • Evolutionaries: They simulate evolution on the computer and draw on genetics and evolutionary biology.
      • Master algorithm for this tribe is genetic programming
    • Bayesians: They believe that learning is a form of probabilistic inference and have their roots in statistics.
      • Master algorithm for this tribe is Bayesian inference
    • Analogizers: They learn by extrapolating from similarity judgments and are influenced by psychology and mathematical optimization.
      • Master algorithm for this tribe is Support vector machines.

In practice each of the master algorithms that a particular tribe uses is good for some type of problems but not for others. What we really want is a single algorithm combining the key features of all of them, The Master Algorithm

  • Enable the reader to invent the master algorithm. A layman, approaching the forest from a distance, is in some ways better placed than the specialist, already deeply immersed in the study of particular trees. The author suggests the reader to pay attention to each tribe and get a general overview of what each tribe does and what tools that each tribe uses. By viewing each tribe as a piece of a puzzle, it would be easy to get a conceptual clarity of the entire field as a whole.

The Machine Learning Revolution

An algorithm is a sequence of instructions telling a computer what to do. The simplest algorithm is : flip a switch. The second simplest algorithm is : combine two bits. The idea connecting transistors and reasoning was understood by Shannon and his masters thesis lead to the most important scientific discipline – information theory. Computers are all about logic. Flipping a set of transistors is all what any algorithm does. However behind this benign activity, some of the most powerful algorithms go about doing their work by using some preexisting algos as building blocks. So, if computing power increases, do all the algos automatically become efficient and all powerful? Not necessarily. The serpent in the garden goes by the name "complexity monster". There are many heads to this complexity monster.

  • Space complexity : the number of bits of info that an algo needs to store in the computer’s memory
  • Time complexity : how long the algo takes to run
  • Human complexity : when algos become too complex, humans cannot control them and any errors in the algo execution causes panic.

Every algorithm has an input and an output – the data goes into the computer, the algo does the job and gives out the result. Machine learning turns this around : in goes the data and the desired result and out comes the algorithm that turns one in to the other. Learning algorithms – learners- are algorithms that make other algorithms. With machine learning, computers write their own programs. The author uses a nice low tech analogy to explain the power of machine learning :

Humans have known a way to get some of the things they need by letting nature make them. In farming, we plant the seeds, make sure they have enough water and nutrients, and reap the grown crops. The promise of machine learning is that the technology can mirror farming. Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.

This analogy makes two things immediate. First the more data we have, the more we can learn. Second, ML is a sword that can slay the complexity monster. If one looks at any learning algorithm, it does broadly two things. It either learns knowledge or learns some skills, i.e. it learns knowledge in terms of statistical models or learns procedures that underlie a skill. The author talks about another analogy in the information eco-system. He equates databases, crawlers, indexers and so on as herbivores, patiently munging on endless fields of data. Statistical algos, online analytical processing are the predators and learning algorithms are the super predators. Predators turn data in to information and Super predators turn information in to knowledge.

Should one go and spend years in computer science discipline to become ML expert ? Not necessarily. CS makes you think deterministically and ML needs probabilistic thinking. The difference in thinking is a large part of why Microsoft has had a lot more trouble catching up with Google than it did with Netscape. A browser is just a standard piece of software, but a search engine requires a different mind-set.

If you look at all the fantastic stuff behind ecommerce applications, it is all about match making; producers of information are being connected to consumers of information. In this context ,learning algorithms are the match makers. The arsenal of learning algos will serve as a key differentiator between the companies. Data is indeed the new oil.

The chapter ends with a discussion of how political campaigning is being revolutionized by ML. With truck loads of data at each voter level, politicians are turning to ML experts to help them with campaign analysis and directed targeting. The author predicts that in the future, machine learning will cause more elections to be close. What he means by that is that learning algorithms will be ultimate retail politicians.

 

The Master Algorithm

If you look at any of the algorithms such as nearest neighbor, decision tree, naive Bayes etc., they are all domain agnostic. This warrants a question, "Can there be one learner that does everything, across all domains?". If you have used let’s say a frequentist way of estimating a parameter and subsequently check it with a Bayesian inference, then both parameters give almost the same distribution if there is a LOT of data. Under a deluge of data, you will most certainly get a convergence of parameters in a model, irrespective of the model used. However the reality is that there is a scarcity of data, that necessitates making assumptions. Depending on the type of problem and depending on the type of assumption, certain class of learning models are better than the rest. If one speculates the presence of a master algorithm, then the assumptions also need to go in as input. This chapter explores the presence of a master algorithm that can be learnt from input data, output data and the assumptions. The author grandly states the central hypothesis of the book as

All knowledge-past, present, and future-can be derived from data by a single, universal learning algorithm.

The author speculates the presence of a master algorithm and says that there are arguments from many fields that speculate the presence of one.

The argument from Neuroscience : The evidence from the brain suggests that it uses the same learning algorithm, throughout, with the areas dedicated to the different senses distinguished only by the different inputs they are connected to. In turn, the associative areas acquire their function by being connected to multiple sensory regions, and the "executive" areas acquire theirs by connecting the associative areas and motor output.

If you look at any cortex in a human brain, you find that the wiring pattern is similar. The cortex is organized in to six different layers, feedback loops, short range inhibitory connections and long-term excitatory connections. This pattern is very much common across the brain. There is some variation in the patterns but the structure pretty much is the same. The analogy is that the it is the same algo with different parameters and settings. Low-level motor skills are controlled by cerebellum that has a clearly different and regular architecture. However experiments have shown that cerebellum can be perfectly replaced by cortex. So, this again goes to suggest that there is some "master algorithm" controlling the entire brain.

If you look at a set of learning algos in the ML field, you can infer that at some level they are trying to reverse engineer the brain’s function. One of the five ML tribes, Connectionists, believe in this way of modeling the world.

The argument from Evolution : If one looks at the evolution of species since the beginning of earth, one can think of natural selection or whatever the nature does as an algorithm. This algorithm has all the species as inputs and the species at any given point in time as the output. The master algo does the work of eliminating certain species, allowing certain species to mutate, etc. It is a dynamic process where the outputs are again fed as inputs. This line of thought makes one speculate the presence of a master algorithm. In fact one of the five tribes in the ML world, Evolutionaries, strongly believe in this way of modeling.

The argument from Physics : Most of the physics is driven by simple equations that prune away all the noise in the data and focus on the underlying beauty. Physics laws discovered in one domain are seamlessly applied to other domains. If everything we see in the nature could be explained by few simple laws, then it makes sense that a single algorithm can induce all the can be induced. All the Master Algorithm has to do is provide a shortcut to the laws’ consequences, replacing impossibly long mathematical derivations with much shorter ones based on actual observations. Another way to look at scientific disciplines is to think of various laws, states as outcomes of an dynamic optimization problem. However physics is unique in its simplicity. It’s only reasonably effective.

Machine learning is what you get when the unreasonable effectiveness of mathematics meets the unreasonable effectiveness of data.

The argument from Statistics : Bayesians look at the world from a probabilistic and learning mindset. Bayes rule is a recipe to turn data in to knowledge. In the yesteryears, Bayes applications were confined to simple applications. However with the rise of computing power, Bayes applications are now being used to a wide range of complex modeling situations. Is Bayes the master algorithm ? Well, there seems to many critics to the Bayesian approach of modeling. All said and done, it definitely appears that Bayesian inference will be a part of "Master Algorithm" in some way.

The argument from computer science : The author mentions the famous unsolved problem in computer science, P vs. NP. NP-complete are a set of problems that are equivalent in their computational hardness. If you solve one of the problems, you solve the rest. For decades, many mathematicians, researchers and scientists have been finding clever tricks to almost solve NP-complete problems. But the fundamental problem still eludes us -  Is the class of problems for which we can efficiently compute same as the class of problems for which we can efficiently check whether a solution exists ? If you read up on this problem, you will realize that all the NP-complete problems can be reduced to a satisfiability problem. If we invent a learner that can learn to solve satisfiability, it has a good claim to being the Master Algorithm.

NP-completeness aside, the sheer fact that a computer can do a gazillion tasks should make one confident about speculating the presence of a master algorithm that does the job across several problems. The author uses the example of Turing machine and says that back then, it was unthinkable to actually see a Turing machine in action. Turing machine can solve every conceivable problem that can be solved by logical deduction. The fact that we see these machines everywhere means that, despite the odds, we might see a Master Algorithm sometime in the future.

The Master Algorithm is for induction, the process of learning, what the Turing machine is for deduction. It can learn to simulate any other algorithm by reading examples of its input-output behavior. Just as there are many models of computation equivalent to a Turing machine, there are probably many different equivalent formulations of a universal learner. The point, however, is to find the first such formulation, just as Turing found the first formulation of the general-purpose computer.

Interesting analogy : The author is of the opinion that "human intuition" can’t replace data. There have been many instances where human intuition has gone terribly wrong and a guy with lots of data has done better. The author uses Brahe, Kepler and Newton’s work to draw a parallel to the machine learning.

Science goes through three phases, which we can call the Brahe, Kepler, and Newton phases. In the Brahe phase, we gather lots of data, like Tycho Brahe patiently recording the positions of the planets night after night, year after year. In the Kepler phase, we fit empirical laws to the data, like Kepler did to the planets’ motions. In the Newton phase, we discover the deeper truths. Most science consists of Brahe-and-Kepler-like work; Newton moments are rare. Today, big data does the work of billions of Brahes, and machine learning the work of millions of Keplers. If-let’s hope so-there are more Newton moments to be had, they are as likely to come from tomorrow’s learning algorithms as from tomorrow’s even more overwhelmed scientists, or at least from a combination of the two.

Critics of Master Algo : Well, for a concept as ambitious as Master Algo, there are bound to be critics and there are many. The author mentions a few of them as examples,

  • Knowledge engineers
  • Marvin Minsky
  • Naom Chomsky
  • Jerry Fodor

Hedgehog or Fox : One of the other questions that comes up when we think of "Master Algorithm" is whether it is a fox or a hedgehog. There are many studies that have shown that being a fox is far better than being a hedgehog. The hedgehog is synonymous with that of an "expert". In the context of this book though, a learning algorithm can be considered as "hedgehog" if variations of it can solve all the learning problems. The author hopes that the "Master Algorithm" turns out to a hedgehog.

Five tribes of ML : In the quest for master algorithm, we do not have to start from scratch. There are already many decades of ML research underway and each type of research community is akin to a tribe. The author describes five tribes, i.e. symbolists, connectivists, Bayesians, evolutionaries and analogizers. Each of the tribes uses its own master algo. Here is an illustration from the author’s presentation

fivetribes

master-algo

But the real master algo that the author is hinting at is an algo that combines all the features of the tribes.

The most important feature of each of the tribe is that they firmly believe that theirs is the only way to model and predict. Unfortunately, this thinking hinders their ability to model a broad set of problems. For example, a Bayesian would find it extremely difficult to leave the probabilistic inference method and look at the problem from a evolutionary point of view. His thinking is forged based on priors, posteriors and likelihood functions. If a Bayesian were to look at an evolutionary algorithm like a genetic algo, he might not critically analyze it and adapt it to the problem at hand. This limitation is prevalent across all tribes. Analogizers love support vector machines but there are limited because they look for similarities of inputs across various dimensions; i.e. they are bound to be hit by curse of dimensionality. The same serpent,"the curse of dimensionality" that the author talks about in the previous chapters comes and bites each tribe, depending on the type of problem being solved.

The obvious question that arises in a reader’s mind is, can there be combination of tribes that come together to solve a specific set of problems ? Indeed the tribe categorization is not a hard categorization of the algorithms. It is just meant as a starting point so that you can place the gamut of algos in separate buckets.

 

Hume’s problem of Induction

The chapter starts with a discussion of "Rationalism vs. Empiricism". The rationalist likes to plan everything in advance before making the first move. The empiricist prefers to try things and see how they turn out. There are philosophers who strongly believe in one and not in the other. From a practical standpoint, there have been productive contributions to our world from both the camps. David Hume is considered to be one of the greatest empiricist of all time. In the context of Machine Learning, one of his questions has hung like a sword of Damocles over all the knowledge, which is,

How can we ever be justified in generalizing from what we’ve seen to what we haven’t?

The author uses a simple example where you have to decide to ask someone out for a date or not. The dataset used in the example illustrates Hume’s problem of induction, i.e. there is no reason to pick one generalization over another. So, a safe way out of the problem, at least to begin with, is to assume that future will be like the past. Is this enough ? Not really. In the ML context, the real problem is : How to generalize to cases that we haven’t seen before. One might think that by amassing huge datasets, you can solve this problem. However once you do the math, you realize that you will run out of data that covers all the cases needed to carry the inductive argument safely. Each new data point is most likely unique and you have no choice but to generalize. According to Hume, there is no way to do it

If this all sounds a bit abstract, suppose you’re a major e-mail provider, and you need to label each incoming e-mail as spam or not spam. You may have a database of a trillion past e-mails, each already labeled as spam or not, but that won’t save you, since the chances that every new e-mail will be an exact copy of a previous one are just about zero. You have no choice but to try to figure out at a more general level what distinguishes spam from non-spam. And, according to Hume, there’s no way to do that.

The "no free lunch" theorem : If you have been reading some general articles in the media on ML and big data, it is likely that you would have come across a view on the following lines:

With enough data, ML can churn out the best learning algo. You don’t have to have strong priors, the fact that you have large data is going to give you all the power to understand and model the world.

The author introduces David Wolpert’s "no free lunch" theorem that a limit on how good a learning algorithm can be. The theorem says that no learner can be better than random guessing. Are you surprised by this theorem ? Here is how one can reconcile to it,

Pick your favorite learner. For every world where it does better than random guessing, I, the devil’s advocate, will deviously construct one where it does worse by the same amount. All I have to do is flip the labels of all unseen instances. Since the labels of the observed ones agree, there’s no way your learner can distinguish between the world and the antiworld. On average over the two, it’s as good as random guessing. And therefore, on average over all possible worlds, pairing each world with its antiworld, your learner is equivalent to flipping coins.

How to escape the above the random guessing limit? Just care about the world we live in and don’t care about alternate worlds. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing. What are the implications of "free lunch theorem" in our modeling world ?

There’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.

Unwritten rule of Machine learning : The author states that the principle laid out by Newton in his work, "Principia", that serves as the first unwritten rule of ML

Whatever is true of everything we’ve seen is true of everything in the universe.

Newton’s principle is only the first step, however. We still need to figure out what is true of everything we’ve seen-how to extract the regularities from the raw data. The standard solution is to assume we know the form of the truth, and the learner’s job is to flesh it out. One of the ways to think about creating a form is via "conjunctive concepts", i.e a series of statements with AND as the bridge. The problem with "conjunctive concepts" is that they are practically useless. Real world is driven by "disjunctive concepts", i.e a concept defined by a set of rules. One of the pioneers in this approach of discovering rules was Ryszard Michalski, a Polish computer scientist. After immigrating to the United States in 1970, he went on to found the symbolist school of machine learning, along with Tom Mitchell and Jaime Carbonell.

Overfitting and Underfitting : The author uses the words "blindness" and "hallucination" to describe underfitting and overfitting models. By using ton of hypothesis, you can almost certainly overfit the data. On the other hand, being sparse in your hypothesis set, you can fail to see the true patterns in the data. This classic problem is obviated by doing out-of-sample testing. Is it good enough ? Well, that’s the best that is available without going in to the muddy philosophical debates or alternative pessimistic approaches like that of Leslie Valiant(author of Probably Approximately Correct).

Induction as inverse of deduction : Symbolists work via the induction route and formulate an elaborate set of rules. Since this route is computationally intensive for large dataset, the symbolists prefer something like decision trees. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. How do we then decide which concept the instance belongs to?

Decision trees are used in many different fields. In machine learning, they grew out of work in psychology. Earl Hunt and colleagues used them in the 1960s to model how humans acquire new concepts, and one of Hunt’s graduate students, J. Ross Quinlan, later tried using them for
chess. His original goal was to predict the outcome of king-rook versus king-knight endgames from the board positions. From those humble beginnings, decision trees have grown to be, according to surveys, the most widely used machine-learning algorithm. It’s not hard to see why: they’re easy to understand, fast to learn, and usually quite accurate without too much tweaking. Quinlan is the most prominent researcher in the symbolist school. An unflappable, down-to-earth Australian, he made decision trees the gold standard in classification by dint of relentlessly improving them year after year, and writing beautifully clear papers about them. Whatever you want to predict, there’s a good chance someone has used a decision tree for it.

The Symbolists : The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols. A mathematician solves equations by moving symbols around and replacing symbols by other symbols according to predefined rules. The same is true of a logician carrying out deductions. According to this hypothesis, intelligence is independent of the substrate.

Symbolist machine learning is an offshoot of the knowledge engineering school of AI. The use of computers to automatically learn the rules made the work of pioneers like Ryszard Michalski, Tom Mitchell, and Ross Quinlan extremely popular and since then the field has exploded

What are the shortcomings of inverse deduction?

  • The number of possible inductions is vast, and unless we stay close to our initial knowledge, it’s easy to get lost in space
  • Inverse deduction is easily confused by noise
  • Real concepts can seldom be concisely defined by a set of rules. They’re not black and white: there’s a large gray area between, say, spam and nonspam. They require weighing and accumulating weak evidence until a clear picture emerges. Diagnosing an illness involves giving more weight to some symptoms than others, and being OK with incomplete evidence. No one has ever succeeded in learning a set of rules that will recognize a cat by looking at the pixels in an image, and probably no one ever will.

An interesting example of a success from Symbolists is Eve, the computer that discovered malaria drug. There was a flurry of excitement a year ago, when an article, titled, Robot Scientist Discovers Potential Malaria Drug was published in Scientific American. This is the kind of learning that Symbolists are gung-ho about.

 

How does your brain learn ?

This chapter covers the second tribe of the five tribes mentioned in the book. This tribe is called "Connectionists". Connectionists are highly critical about the way Symbolists work as they think that describing something via a set of rules is just the tip of iceberg. There is lot more going under the surface that formal reasoning can’t see. Let’s say you come across the word "love", Symbolists would associate a rule with such a concept whereas Connectionists would associate various parts of the brain to such a concept. In a sense, there is no one to one correspondence between a concept and a symbol. Instead the correspondence is many to many. Each concept is represented by many neurons, and each neuron participates in representing many different concepts. Hebb’s rule is the corner stone of connectionists. In a non-math way, it says that "Neurons that fire together stay together". The other big difference between Symbolists and Connectionists is that the former tribe believes in sequential processing whereas the latter tribe believes in parallel processing.

To get some basic understanding of the key algos used by connectionists, it is better to have a bit of understanding of the way neuron is structured in our brain. Here is a visual that I picked up from the author’s presentation :

neuron

The branches of the neuron connect to others via synapses and basic learning takes place via synaptic connections. The first formal model of a neuron was proposed by Warren McCulloch and Walter Pitts in 1943. It looked a lot like the logic gates computers are made of. The problem with this model was that the model did not learn. It was Frank Rosenblatt who came up with the first model of learning by giving variable weights to the connections between neurons. The following is a good schematic diagram of the perceptron:

perceptron

This model generated a lot of excitement and ML received a lot of funding for various research projects. However this excitement was short lived. Marvin Minsky and few others published many examples where perceptron failed to learn. One of the most simple and dangerous example that perceptron could not learn was XOR operator. Perceptron was mathematically unimpeachable, searing in its clarity, and disastrous in its effects. Machine learning at the time was associated mainly with neural networks, and most researchers (not to mention funders) concluded that the only way to build an intelligent system was to explicitly program it. For the next fifteen years, knowledge engineering would hold center stage, and machine learning seemed to have been consigned to the ash heap of history.

Fast forward to John Hopfield work on spin glasses, there was a reincarnation of perceptron

Hopfield noticed an interesting similarity between spin glasses and neural networks: an electron’s spin responds to the behavior of its neighbors much like a neuron does. In the electron’s case, it flips up if the weighted sum of the neighbors exceeds a threshold and flips (or
stays) down otherwise. Inspired by this, he defined a type of neural network that evolves over time in the same way that a spin glass does and postulated that the network’s minimum energy states are its memories. Each such state has a "basin of attraction" of initial states that converge to it, and in this way the network can do pattern recognition: for example, if one of the memories is the pattern of black-and-white pixels formed by the digit nine and the network sees a distorted nine, it will converge to the "ideal" one and thereby recognize it. Suddenly, a vast body of physical theory was applicable to machine learning, and a flood of statistical physicists poured into the field, helping it break out of the local minimum it had been stuck in.

The author goes on to describe "Sigmoid" function and its ubiquitous nature. If you think about the curve for sometime, you will find it everywhere. I think the first time I came across this function was in Charles Handy’s book, "The Age of Paradox". Sigmoid functions in that book are used to describe various types of phenomenon that show an exponential slow rate of increase in the beginning, then a sudden explosive rate of increase and subsequently with an exponential rate of decrease. Basically if you take the first derivative of the Sigmoid function, you get the classic bell curve. I think the book,"The Age of Paradox" had a chapter with some heavy management gyan that went something like – "you need to create another Sigmoid curve in your life before the older Sigmoid curve starts a downfall" or something to that effect. I don’t quite recollect the exact idea from Charles Handy’s book, but there is a blog post by Bret Simmons, titled The Road to Davy’s Bar that goes in to related details.

Well, in the context of ML, the application of Sigmoid curve is more practical. It can be used to replace the step function and suddenly things become more tractable. A single neuron can learn a straight line but a set of neurons, i.e multi-layer perceptron can learn more convoluted curves. Agreed there is a curse of dimensionality here, but if you think about it, the hyperspace explosion is a double edged sword. On the one hand, there objective function is far more wiggly but on the other hand, there is a less scope that you will stuck at a local minimum via gradient search methods. With this Sigmoid input and multi layer tweak, Perceptron came back with vengeance. There was a ton of excitement just like the time when perceptron was introduced. The algorithm by which the learning takes place is called "back propagation", a term that is analogous to how human brains work. This algo was invented by David Rumelhart in 1986. It is a variant of gradient descent method. There is no mathematical proof that Back propagation will find the global minimum/maximum, though. The backprop solves what the author calls "credit assignment" problem. In a multi-layered perceptron the error between the target value and the current value needs to be propagated across all layers backward. The basic idea of error propagation, i.e error assignment for each of the layers is done via backprop.

Whenever the learner’s "retina" sees a new image, that signal propagates forward through the network until it produces an output. Comparing this output with the desired one yields an error signal, which then propagates back through the layers until it reaches the retina. Based on this returning signal and on the inputs it had received during the forward pass, each neuron adjusts its weights. As the network sees more and more images of your grandmother and other people, the weights gradually converge to values that let it discriminate between the two.

Sadly the excitement phase petered down as learning with dozens of hundreds of hidden layers was computationally difficult. In the recent years though, backpropagation has made a comeback thanks to huge computing power and big data. It now goes by the technique "Deep Learning". The key idea of deep learning is based on auto encoders that is explained very well by the author. However there are many things that need to be worked out for deep learning to be anywhere close to the Master algorithm. All said and done there are a few limitations to exclusively following a connectionist tribe. Firstly, the learning algo is difficult to comprehend. It comprises convoluted connections between various neurons. The other limitation is that the approach is not compositional, meaning it is divorced from the way a big part of human cognition works.

 

Evolution : Nature’s Learning Algorithm

The chapter starts with the story of John Holland, the first person to have earned a PhD in computer science in 1959. Holland is known for his immense contribution to Genetic algorithms. His key insight lay in coming up with a fitness function that would assign a score to every program considered. What’s the role of fitness function ? Starting with a population of not-very-fit individuals-possibly completely random ones-the genetic algorithm has to come up with variations that can then be selected according to fitness. How does nature do that? This is where the genetic part of the algorithm comes in. In the same way that DNA encodes an organism as a sequence of base pairs, we can encode a program as a string of bits. Variations are produced by crossovers and mutations. The next breakthrough in the field of genetic programming came from Holland’s student John Koza who came up with the idea of evolving full blown computer programs.

Genetic programming’s first success, in 1995, was in designing electronic circuits. Starting with a pile of electronic components such as transistors, resistors, and capacitors, Koza’s system reinvented a previously patented design for a low-pass filter, a circuit that can be used for things like enhancing the bass on a dance-music track. Since then he’s made a sport of reinventing patented devices, turning them out by the dozen. The next milestone came in 2005, when the US Patent and Trademark Office awarded a patent to a genetically designed factory optimization system. If the Turing test had been to fool a patent examiner instead of a conversationalist, then January 25, 2005, would have been a date for the history books. Koza’s confidence stands out even in a field not known for its shrinking violets. He sees genetic programming as an invention machine, a silicon Edison for the twenty-first century.

A great mystery in genetic programming that is yet to be solved conclusively is the role of crossover. None of Holland’s theoretical results show that crossover actually helps; mutation suffices to exponentially increase the frequency of the fittest schemas in the population over time. There were other problems with genetic programming that finally made ML community at large divorce itself from this tribe

Evolutionaries and connectionists have something important in common: they both design learning algorithms inspired by nature. But then they part ways. Evolutionaries focus on learning structure; to them, fine-tuning an evolved structure by optimizing parameters is of secondary importance. In contrast, connectionists prefer to take a simple, hand-coded structure with lots of connections and let weight learning do all the work. This is machine learning’s version of the nature versus nurture controversy. As in the nature versus nurture debate, neither side has the whole answer; the key is figuring out how to combine the two. The Master Algorithm is neither genetic programming nor backprop, but it has to include the key elements of both: structure learning and weight learning. So, is this it ? Have we stumbled on to the right path for "Master Algorithm" ? Not quite. There are tons of problems with evolutionary algos. Symbolists and Bayesians do not believe in emulating nature. Rather, they want to figure out from first principles what learners should do. If we want to learn to diagnose cancer, for example, it’s not enough to say "this is how nature learns; let’s do the same." There’s too much at stake. Errors cost lives. Symbolists dominated the first few decades of cognitive psychology. In the 1980s and 1990s, connectionists held sway, but now Bayesians are on the rise.

 

In the Church of the Reverend Bayes

Perci Diaconis in his paper titled, MCMC Revolution, says that MCMC technique that came from Bayesian tribe has revolutionized applied mathematics. Indeed, thanks to high performance computing ability, Bayes is now a standard tool in any number cruncher’s tool kit. This chapter talks about various types of Bayesian techniques. The basic idea behind Bayes is that it is a systematic and quantified way of updating degrees of belief, in the light of new data. You can pretty much cast any problem, irrespective of the size of the data available, in to a Bayesian inference problem. Bayes theorem usually goes by the name "inverse probability" because in real life we know Pr(effect|cause) and we are looking to compute Pr(cause|effect). Bayes’ theorem as a foundation for statistics and machine learning is bedeviled not just by computational difficulty but also by extreme controversy. The main point of conflict between Bayesians and Non-Bayesians is the reliance of subjective priors to "turn on the Bayesian crank". Using subjective estimates as probabilities is considered sin by Frequentists, for whom, everything should be learned from the data.

One of the most common variant of Bayesian models is the "Naive Bayes" model where each cause is independent of other causes in creating an effect. Even though this assumption sounds extremely crazy, there are a ton of areas where Naive Bayes beats sophisticated models. No one is sure who invented the Naïve Bayes algorithm. It was mentioned without attribution in a 1973 pattern recognition textbook, but it only took off in the 1990s, when researchers noticed that, surprisingly, it was often more accurate than much more sophisticated learners. Also if you reflect a bit, you will realize that Naive Bayes is closely related to Perceptron algorithm.

The author mentions Markov Models as the next step in the evolution of Bayes models. Markov models are applicable to a family of random variables where each variable is conditionally independent of its history except the current state. Markov chains turn up everywhere and are one of the most intensively studied topics in mathematics, but they’re still a very limited kind of probabilistic model. A more complicated model is Hidden Markov Model where we don’t get to see the actual states but we have to infer them from the observations. A continuous version of HMM goes under the name "Kalman Filter" that has been used in many applications across domains.

Naive Bayes, Markov Models, Hidden Markov Models are all good but they are all a far cry from Symbolists. The next breakthrough came from Judea Pearl who invented Bayesian Networks. This allowed one to specify complex dependencies among random variables. By defining the conditional independence of a variable given a set of neighboring nodes, Bayesian networks tame the combinatorial explosion and make inferences tractable. Basically Bayesian Network can be thought of as a "generative model", a recipe for probabilistically generating a state of the world. Despite the complex nature of a Bayesian net, the author mentions that there have been techniques developed to successfully infer various aspects of the network. In this context, the author mentions MCMC and gives an intuitive explanation of the technique. A misconception amongst many is that MCMC is a simulation technique. Far from it, the procedure does not simulate any real process; rather it is an efficient way to generate samples from a Bayesian network. Inference in Bayesian networks is not limited to computing probabilities. It also includes finding the most probable explanation for the evidence. The author uses the "poster child" example of inferring the probability of heads from coin tosses to illustrate the Bayesian technique and compare it with the Frequentist world of inference.

The next set of models that came to dominate the Bayesian tribe is Markov Networks. A Markov network is a set of features and corresponding weights, which together define a probability distribution. Like Bayesian networks, Markov networks can be represented by graphs, but they have undirected arcs instead of arrows. Markov networks are a staple in many areas, such as computer vision. There are many who feel that Markov networks are far better than Naive Bayes, HMMs etc., as they can capture the influence from surroundings.

Bayesians and symbolists agree that prior assumptions are inevitable, but they differ in the kinds of prior knowledge they allow. For Bayesians, knowledge goes in the prior distribution over the structure and parameters of the model. In principle, the parameter prior could be anything we please, but ironically, Bayesians tend to choose uninformative priors (like assigning the same probability to all hypotheses) because they’re easier to compute with. For structure, Bayesian networks provide an intuitive way to incorporate knowledge: draw an arrow from A to B if you think that A directly causes B. But symbolists are much more flexible: you can provide as prior knowledge to your learner anything you can encode in logic, and practically anything can be encoded in logic-provided it’s black and white.

Clearly, we need both logic and probability. Curing cancer is a good example. A Bayesian network can model a single aspect of how cells function, like gene regulation or protein folding, but only logic can put all the pieces together into a coherent picture. On the other hand, logic can’t deal with incomplete or noisy information, which is pervasive in experimental biology, but Bayesian networks can handle it with aplomb.Combining connectionism and evolutionism was fairly easy: just evolve the network structure and learn the parameters by backpropagation. But unifying logic and probability is a much harder problem.

 

You are what you resemble

The author introduces techniques of the "Analogizers" tribe. This tribe uses similarities among various data points to categorize them in to distinct classes. In some sense, we all learn by analogy. Every example that illustrates an abstract concept is like an analogy. We learn by relating the similarity between two concepts and then figure what else one can infer based on the fact that two concepts are similar.

The chapter begins with explaining the most popular algorithm of the tribe, "the nearest neighbor algorithm". This was invented way back in 1951 by Evelyn Fix and Joe Hodges. The inventors faced a massive difficulty in publishing their algorithm. However the fact that the algo remained unpublished did not faze many researchers who went about developing variants of the algorithm like "K nearest neighbor" method, "Weighted K-nearest neighbor" etc. It was in 1967 that Tom Cover and Peter Hart proved that, given enough data, nearest-neighbor is at worst only twice as error-prone as the best imaginable classifier. This was a momentous revelation. Up until then, all known classifiers assumed that the frontier had a very specific form, typically a straight line. This was a double-edged sword: on the one hand, it made proofs of correctness possible, as in the case of the perceptron, but it also meant that the classifier was strictly limited in what it could learn. Nearest-neighbor was the first algorithm in history that could take advantage of unlimited amounts of data to learn arbitrarily complex concepts. No human being could hope to trace the frontiers it forms in hyperspace from millions of examples, but because of Cover and Hart’s proof, we know that they’re probably not far off the mark.

Is nearest neighbor algo, the master algorithm ? It isn’t because of curse of dimensionality. As the dimension of covariates goes up, the NN algo efficiency goes down. In fact the curse of dimensionality is the second most important stumbling block in the Machine learning, over-fitting being the first one. There are certain techniques to handle the dimension explosion but most of them are hacks and there is no guarantee that they are going to work.

Subsequently, the author introduces Support Vector Machines(SVM) that has become the most popular technique used by Analogizers. I loved the way author describes this technique using plain simple English. He asks the reader to visualize a fat serpent that moves between two countries that are at war. The story of finding the serpent incorporates pretty much all the math that is needed to compute support vectors, i.e.

  • kernel for SVM
  • support vectors
  • weight of the support vectors
  • constrained optimization
  • maximizing the margin of the classifier

My guess is, one would understand the math far easier, after reading through this section on SVMs. SVMs have many advantages and the author highlights most of them. Books such as these also help us in verbalizing math stuff in simple words. For example, if you were to explain the difference between constrained optimization and unconstrained optimization to a taxi driver, how would you do it? Read this book to check whether your explanation is better than what the author provides.

Towards the end of the chapter, the author talks about case-based reasoning and says that in the years to come, analogical reasoning will become so powerful that it will sweep through all the fields where case-based reasoning is still employed.

 

Learning without a teacher

Unlike the previous chapters that focused on labeled data, this chapter is learning via unsupervised learning. Cognitive scientists describe theories of child learning using algos and machine learning researchers have developed techniques based on them. The author explains k-means algorithm, a popular clustering technique. It is actually a special case of Expectation Maximization(EM) algorithm that was invented by three Harvard statisticians. EM is used in a ton of places. To learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on them. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM. Once you have a cluster at the macro level, nothing stops you from using the same algo for each cluster and come up with sub-clusters etc.

Subsequently, the author introduces another popular technique for unsupervised learning, PCA, that is used for dimensional reduction. PCA tries to come up linear combination of various dimensions in the hyperspace so that total variance of the data across all dimensions is maximized. A step up to this algo is called "Isomap", a nonlinear dimensionality reduction technique. It connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network and finds the reduced coordinates that best approximate these distances.

After introducing clustering and dimensional reduction techniques, the author talks about "Reinforcement learning", a technique that relies on immediate response of the environment for various actions of the learner. Research on reinforcement learning started in earnest in the early 1980s, with the work of Rich Sutton and Andy Barto at the University of Massachusetts. They felt that learning depends crucially on interacting with the environment, but supervised algorithms didn’t capture this, and they found inspiration instead in the psychology of animal learning. Sutton went on to become the leading proponent of reinforcement learning. Another key step happened in 1989, when Chris Watkins at Cambridge, initially motivated by his experimental observations of children’s learning, arrived at the modern formulation of reinforcement learning as optimal control in an unknown environment. A recent example of a successful startup that combines neural networks and reinforcement learning is "DeepMind", a company that was acquired by Google for half a billion dollars.

Another algorithm that has a potential to be a part of "Master Algorithm" is chunking. Chunking remains a preeminent example of a learning algorithm inspired by psychology. The author gives a basic outline of this concept. Chunking and reinforcement learning are not as widely used in business as supervised learning, clustering, or dimensionality reduction, but a simpler type of learning by interacting with the environment is: A/B testing. The chapter ends with the author explaining another potentially killer algo, "relational learning".

 

The Pieces of the Puzzle Fall into Place

Progress in science comes from unifying theories; two or more seemingly disparate observations are driven by the same logic or law. If one looks at ML, it appears that Master Algorithm is akin to a unifying theory in science. It will unify all the master algorithms of each tribes, all the techniques of each tribes and give one cohesive way to learn from data.

In fact there is already a technique called "meta learning" that some of the tribes use with in their techniques. For example, bagging, random forests and boosting are some of the famous meta learning techniques used by Symbolists. Bayesians have something called "model averaging" that learns from each model by considering it as an hypothesis and then computes a score based on the vote given by each of the models. Meta learning in its current avatar is remarkably successful, but it’s not a very deep way to combine models. It’s also expensive, requiring as it does many runs of learning, and the combined models can be quite opaque.

The author used the following schematic diagram for each of the tribes, while explaining the rationale of a possible “Master Algorithm”

puzzle

He then takes the reader through a tour of each of the tribes philosophy and their master algorithms and comes up a unifier, called "Alchemy", that he calls it as the "Master Algorithm". In the process of creating this master algorithm, he introduces Markov Logic Networks and says that they serve for representing the problem. Alchemy uses posterior probability as the evaluation function, and genetic search coupled with gradient descent as the optimizer. The author is wary about Alchemy’s immediate application and says that there is a ton of research that is yet to be done, so that it can become a true Master Algorithm, i.e., one that has the capability to solve hard problems.

This chapter is a little more involved as it tries to connect all the ideas from the previous eight chapters and introduces a way to combine the various pieces of puzzle to create a "Master Algorithm". The chapter will also be very interesting for an aspiring ML researcher  who is trying to pick his focus area.

 

This is the World on Machine Learning

The last chapter of the book discusses the world in which "Master Algorithm" is all pervasive. The author tries to speculate answers to the following questions :

  • Will humans be replaced by machines ?
  • What do you want the learners of the world to know about you ?
  • How good a model of you, a learner, can have ?
  • How will ecommerce shopping experience be ?
  • Will there be a rise of "humanities" disciple after the automation of most of non-human related tasks ?
  • What will the current online dating sites morph in to ?
  • How will the Amazons, Netflixes, Googles of the world change ?
  • What will be the privacy issues in a society where most of the transactions and activities involve, one algo talking to another algo?
  • Will future wars be fought by robots ?
  • Will robot-warfare be viable ?
  • Will AI and Master Algorithm take over the world ?

The author ends the book by saying,

Natural learning itself has gone through three phases: evolution, the brain, and culture. Each is  product of the previous one, and each learns faster. Machine learning is the logical next stage of this progression. Computer programs are the fastest replicators on Earth: copying them takes  only a fraction of a second. But creating them is slow, if it has to be done by humans. Machine learning removes that bottleneck, leaving a final one: the speed at which humans can absorb change.

takeawayTakeaway :

This book is a pop-science book for Machine Learning. ML has reached a point where it is not just for geeks anymore. Every one needs to know about it, Every one needs to at least have a conceptual model about the field as it has become all pervasive. Having said that, it would take time to plod though the book if you are a complete newbie to ML. This book is massively appealing to someone who a cursory knowledge of a few ML techniques and wants to have a 10,000 ft. view of the entire fields’ past, present and future. The future, as the title of book states is, would be the invention of a master algorithm that unifies methods across all the five tribes.

reproducible-research

The book starts by explaining an example project that one can download from the author’s github account. The project files serve as an introduction to reproducible research. I guess it might make sense to download this project, try to follow the instructions and create the relevant files. By compiling the example project, one gets a sense of what one can accomplished by reading through the book.

Introducing Reproducible Research
The highlight of an RR document is that data, analysis and results are all in one document. There is no separation between announcing the results and doing number crunching. The author gives a list of benefits that accrue to any researcher generating RR documents. They are

  • better work habits
  • better team work
  • changes are easier
  • high research impact

The author uses knitr / rmarkdown in the book to discuss Reproducibility. The primary difference between the two is that the former demands that document be written using the markup language associated with the desired output. The latter is more straightforward in the sense that one markup can be used to produce a variety of outputs.

Getting Started with Reproducible Research
The thing to keep in mind is that reproducibility is not an after thought – it is something you build into the project from the beginning. Some general aspects of RR are discussed by the author. If you do not believe in the benefits of RR, then you might have to carefully read this chapter to understand the benefits as it gives some RR tips to a newbie. This chapter also gives a road map to the reader as to what he/she can expect from the book. In any research project, there is data gathering stage, data analysis stage and presentation stage. The book contains a set of chapters addressing each stage of the project. More importantly, the book contains ways to tie each of the stages so as to produce a single compendium for your entire project.

Getting started with R, RStudio and knitr/rmarkdown
This chapter gives a basic introduction to R and subsequently dives in to knitr and rmarkdown commands. It shows how one can create a .Rnw or .Rtex document and convert in to a pdf either through RStudio or the command line. rmarkdown documents on the other hand are more convenient for reproducing simple projects where there are not many interdependencies between various tasks. Obviously the content in this chapter gives only a general idea. One has to dig through the documentation to make things work. One learning for me from this chapter is the option of creating .Rtex documents in which the syntax can be less baroque.

Getting started with File Management
This chapter gives the basic directory structure that one can follow for organizing the project files. One can use the structure as a guideline for one’s own projects. The example project uses gnu make file for data munging. It also gives a crash course of bash. 

Storing, Collaborating, Accessing Files, and Versioning
The four activities mentioned in the chapter title can be done in many ways. The chapter focuses on Dropbox and Github. It is fairly easy to learn to use the limited functionality one gets from Dropbox. On the other hand, Github demands some learning from a newbie. One needs to get to know the basic terminology of git. The author does a commendable job of highlighting the main aspects of git version control and its tight integration with RStudio.

Gathering Data with R

This chapter talks about the way in which one can use GNU make utility to create a systematic way of gathering data. The use of make file makes it easy for other to reproduce the data preparation stage of a project. If you have written a make file in C++ or in some other context, it is pretty easy to follow the basic steps mentioned in the chapter. Else it might involve some learning curve. My guess is once you start writing make files for specific tasks, you will realize their tremendous value in any data analysis project. A nice starting point for learning make file is robjhyndman’s site.

Preparing Data for Analysis
This gives a whirlwind tour of data munging operations and data analysis in R.

Statistical Modeling and knitr
The chapter gives a brief description of chunk options that are frequently used in an RR document. Out of all the options, cache.extra and dependson are the options that I have never used in the past and is a learning for me. One of the reasons I like knitr is its ability to cache objects. In the Sweave era, I had to load separate packages, do all sorts of things to run a time intensive RR document. It was very painful to say the least. Thanks to knitr it is extremely easy now. Even though cache option is described at the end, I think it is one of the most useful features of the package. Another good thing is that you can combine various languages in RR document. Currently knitr supports the following language engines :

  • Awk
  • Bash shell
  • CoffeeScript
  • Gawk
  • Haskell
  • Highlight
  • Python
  • R (default)
  • Ruby
  • SAS
  • Bourne shell


Showing results with tables
In whatever analysis you do using R , there are always situations where your output is in the form of a data.frame or matrix or some sort of list structure that is formatted to display on the console as a table. One can use kable to show data.frame and matrix structures. It is simple, effective but limited in scope. xtable package on the other hand is extremely powerful. One can use various statistical model fitted objects and pass it on to xtable function to obtain a table and tabular environment encoded for the results. The chapter also mentions texreg that is far more powerful than the previous mentioned packages. With texreg , you can show the output of more than one statistical model as a table in your RR document.There are times when the output classes are not supported by xtable. In such cases, one has to manually hunt down the relevant table, create a data frame or matrix of the relevant results and then use xtable function.

Showing results with figures
It is often better to know basic LaTeX syntax for embedding graphics before using knitr. One problem I have always faced with knitr embedded graphics is that all the chunk options should be mentioned in one single line. You cannot have two lines for chunk options. Learnt a nice hack from this chapter where some of the environment level code can be used as markup rather than as chunk options .This chapter touches upon the main chunk options relating to graphics and does it well, without overwhelming the reader.

Presentation with knitr/LaTeX
The author says that much of the LaTeX in the book has been written using Sublime Text editor. I think this is the case with most of the people who intend to create an RR. Even though RStudio has a good environment to create a LaTeX file, I usually go back to my old editor to write LaTeX markup. How to cite bibliography in your document? and How to cite R packages in your document? are questions that every researcher has to think about in producing RR documents. The author does a good job of highlighting the main aspects of this thought process. The chapter ends with a brief discussion on Beamer. It gives a 10,000 ft. of beamer. I stumbled on to a nice link in this chapter that gives the reason for using fragile in beamer.

Large knitr/LaTeX Documents: Theses, Books, and Batch Reports
This chapter is extremely useful for creating long RR documents. In fact if your RR document is not large, it makes sense to logically subdivide in to separate child documents. For knitr, there are chunk options to specify parent and child relationships. These options are useful in knitting child documents independently of the other documents embedded in the parent document. You do not have to specify the preamble code again in each of the child documents as it inherits the code from the parent document. The author’s also shows a way to use Pandoc to change rmarkdown document to tex, which can then be included in the RR document.

The penultimate chapter is on rmarkdown. The concluding chapter of the book discusses some general issues of reproducible research.

takeawayTakeaway:

This book gives a nice overview of the main tools that one can use in creating a RR document. Even though the title of the book has the term “RStudio” in it, the tools and the hacks mentioned are IDE agnostic. One can read a book length treatment for each of the tools mentioned in the book and might easily get lost in the details. Books such as these give a nice overview of all the tools and hence motivate the reader to dive into specifics as and when there is requirement.

book_cover

This book can be used as a companion to a more pedagogical text on survival analysis. For someone looking for an appropriate R command to use, for fitting certain kind of survival model, this book is apt. This book neither gives the intuition nor the math behind the various models. It appears like an elaborate help manual for all the packages in R, related to event history analysis.

I guess one of the reasons for the author writing this book is to highlight his package eha on CRAN. The author’s package is basically a layer on survival package that has some advanced techniques which I guess only a serious researcher in this field can appreciate. The book takes the reader through the entire gamut of models using a pretty dry format, i.e. it gives the basic form of a model, the R commands to fit the model,and some commentary on how to interpret the output. The difficulty level is not a linear function from start to end. I found some very intricate level stuff interspersed among some very elementary estimators. An abrupt discussion of Poisson regression breaks the flow in understanding Cox model and its extensions. The chapter on cox regression contains detailed and unnecessary discussion about some elementary aspects of any regression framework. Keeping these cribs aside, the book is useful as a quick reference to functions from survival, coxme, cmprsk and eha packages.

image

image Takeaway

“Write your code as though you are releasing it as a package” – This kind of thinking forces one to standardize directory structure, abandon adhoc scripts and instead code well thought out functions, and finally leverage the devtools functionality to write efficient, extensible and shareable code.

bookcover

image Takeaway :

The book is definitely overpriced for what it delivers. Given that 50% of this book explains R basics, the title is not at all appropriate. Even the quant stuff that is covered in the remaining 50% of the book is laughably inadequate.

image

In the last few decades, enormous computational speed has become accessible to many. Modern day desktop has good enough memory and processing speed that enables a data analyst to compute probabilities and perform statistical inference by writing computer programs. In such a context, this book can serve as a starting point to anyone who wishes to explore the subject of computational probability. This book has 21 puzzles that can be solved via simulation.

Solving a puzzle has its own advantages. Give a dataset with one dependent variable and a set of predictors to a dozen people asking them to fit a regression model; I bet that you will see at least a dozen models, each of which could be argued as a plausible model. Puzzles are different. There are constraints put around the problem that you are forced to get that ONE RIGHT solution to the problem. In doing so, you develop much more sophisticated thinking skills.

In the introductory chapter of the book, the author provides a basic framework for computational probability by showing ways to simulate and compute probabilities. This chapter gives the reader all the ammunition required to solve the various puzzles of the book. The author provides detailed solutions that includes relevant MATLAB code, to all the 21 puzzles.

Some of my favorite puzzles from the book that are enlightening as well as paradoxical are :

  • ˆ The Gamow-Stern Elevator
  • ˆ The Pipe Smoker’s Discovery
  • ˆ A Toilet Paper Dilemma
  • ˆ Parrondo’s Paradox
  • ˆ How Long Is the Wait to Get the Potato Salad ?
  • ˆ The Appeals court Paradox

Here is the link to my document that flushes out the details of all the 21 puzzles in the book:

What’s in the above document?

I have written R code that aims to computationally solve each of the puzzles in the book. For each puzzle, there are two subsections. First subsection spells out my attempt at solving the puzzle. The second subsection contains my learning from reading through the solution given by the author. The author provides extremely detailed MATLAB code that anyone who has absolutely no exposure to MATLAB can also understand the logic. In many cases I found that the code snippets in the book looked like elaborate pseudo code. There are many good references mentioned for each of the puzzles so that interested readers can explore further aspects. In most of the cases, the reader will realize that closed form solutions are extremely tedious to derive and simulation based procedures make it easy to obtain solutions to many intractable problems.

There is a difference between an R user and an R programmer. The former is usually concerned with writing R scripts, using existing R libraries, in order to do data wrangling / model development / back testing or creating an reproducible research document. R programmer on the other hand is usually interested in creating a package / reusable code that can be used by others in his company / by R community.

The author of the book, Hadley Wickham, has been one of the biggest contributors of infrastructure and visualization packages in R. In this book, the author does a splendid job of explaining the nuts and bolts of R, the knowledge that he has gained from spending more than two decades building useful R packages.

Given the title, it is pretty obvious that this should not be someone’s first book. But I think this should be read as soon as you get some good enough idea about R. Delaying too much in reading this book might not be a nice idea. Even if you intend to be a R user and not an R programmer, I think it is worth your while to spend time and effort in working through this book. If you are intend to write your own packages either for internal use or for community use, this book  is priceless. 

I found the content on “non standard evaluation”, i.e chapters 13 & 14,  to be extremely well organized. You probably have to spend a lot of time on stackoverflow / read other’s code to get the kind of understanding you get, from reading these chapters. Most of the initial chapters in the book start with a few teaser questions . The author suggests to the reader that she can skip the chapter if she is comfortable with all the questions. My guess is that, even a seasoned R programmer will find some of these questions tricky/ tough to answer and will start working through most of the initial chapters.

This is not a book that you can read over a weekend or even a fortnight. My guess is it will take at least a month or two to understand and reflect on many gems scattered through out the book . If you have used shiny package for any UI development, reading chapter 15 of the book basically gives a good idea of what happens behind shiny.  If you want to write a C++ function to speed up things and want to use R to call the function, you will have to slog through chapter 19 that deals with this aspect comprehensively. There are two chapters on functional programming that explain the powerful features of R. 

My favorite chapters in the book are the ones on non-standard evaluation. The thing about R is that it can be used as an interactive language as well as a functional programming language. So, whenever you write your code and want it to be extensible, you need to think about quite a lot of things. Just read the source code of any base R function and you will see that it looks very different. There are many tradeoffs that an R developer has to make. What are they ? Such questions and many more about non-standard programming are thoroughly explained in ~ 50 pages. There are also chapters on memory management, code optimization, that will be of immense use to any R user/ programmer. 

I think this book can serve as an excellent guide to anyone wanting to be a better R user / R developer. It has tons of useful references that you might want to go over, if you really want to understand the language at a deeper level.

image

This book is mainly targeted towards R-newbies who have managed to learn some basic R syntax and are looking for some guidance on writing functions in R. The key idea that one needs to understand before writing functions is the way R organizes objects in various environments. The author explains this idea using the analogy of “file system in a computer”. The book uses two examples, 1) shuffling a deck and dealing cards from it 2) simulating the outcomes of a slot machine. The code is extensively annotated so as to make it easy for any reader to thoroughly understand each code fragment. Even though the examples used in the book are very simple, the author manages to cover quite a number of key concepts of the R language.

Here is a suggestion to someone who thinks he/she is too good at R programming : Code up your own slot machine based on the description given in chapter 7. Check the time your code takes to simulate 10 million slot machine payoffs. See if your code does better than the author’s code that takes up just 23 seconds.

For me, the takeaway from this book is the following visual cue useful to remember about R environments :

image

image

It was bootstrapping that made me start off on my statistics journey years ago. I have very fond memories of the days when I could understand simple things in statistics without resorting to complicating looking formulae. A few lines of code were all that was needed. Slowly I became more curious about many things in statistics and that’s how my love affair with stats began. There are two bibles that any newbie to bootstrap should go over; one by Efron & Tibshirani and the other by Davison & Hinkley. Any other specific topics, you can always understand by reading papers. It is always a nice feeling for me to read stuff about bootstrapping. However reading this book was an extremely unpleasant experience.

In the recent years with the rise of R, many authors have started writing books such as “Introduction to ____( fill in any statistical technique that you want to) using R”. With many more people adopting R, these books hope to fill the need of a data analyst who might not be willing immerse himself/herself in to the deep theories behind a technique. The target audience might want some package that can be used to crunch out numbers. Fair enough. Not everyone has the time and inclination to know the details. There are some amazing books that fill this need and do it really well. Sadly, this book is not in that category. Neither does it explains the key functions for using bootstrapping nor does it explain the code that has been sprinkled in the book. So, the R in the title is definitely a misleading one. Instead of talking about the nuances of the various functions based on author’s experience, all one gets to see is some spaghetti code in the book. I can’t imagine an author using 15 pages of the book (that too within a chapter and not the appendix) in listing various packages that have some kind of bootstrap function. That’s exactly the authors of this book have done. Insane! This book gives a historical perspective of various developments around bootstrapping techniques. You can’t learn anything specific from the book. It just gives a 10000 ft. overview of various aspects of bootstrapping. I seriously do not understand why the authors has even written this book. My only purpose in writing this review is to dissuade others from reading this book and wasting their time and money.

Introduction

The bootstrap is one of the number of techniques that are a part of a broad umbrella of nonparametric statistics that are commonly called resampling methods. It was the article by Brad Efron in 1979 that started it all. The impact of this important publication can be gauged by the following statement in Davison and Hinkley’s book :

The idea of replacing complicated and often inaccurate approximations to biases, variances and other measures of uncertainty by computer simulation caught the imagination of both theoretical researchers and users of statistical methods

Efron’s motivation was to construct a simple approximation to Jackknife procedure that was initially developed by John Tukey. Permutation methods were known since 1930s but they were ineffective beyond small samples. Efron connected bootstrapping techniques to the then available jackknife, delta method, cross validation and permutation tests. He was the first to show that bootstrapping was a real competitor to jackknife and delta method for estimating the standard error of an estimator. Throughout 1980s to 1990s, there was an explosion of papers on this subject. Bootstrap was being used for confidence intervals, hypothesis testing and more complex problems. In 1983, Efron wrote a remarkable paper that showed that bootstrap worked better than crossvalidation in classification problems of a certain kind. While these positive developments were happening, by 1990s, there were also papers that showed bootstrap estimates were not consistent in specific settings. The first published example of an inconsistent bootstrap estimate appeared in 1981. By the year 2000, there were quite a few articles that showed that bootstrapping could be a great tool to estimate various functions but it can also be inconsistent. After this brief history on bootstrapping, the chapter goes in to defining some basic terms and explaining four popular method; jackknife, delta method, cross validation and subsampling. Out of all the packages mentioned in the chapter (that take up 15 pages), I think all one needs to tinker around to understand basic principles are boot and bootstrap

Estimation

This chapter talks about improving the point estimation via bootstrapping. Historically speaking, the bootstrap method was looked at, to estimate the standard error of an estimate and later for a bias adjustment. The chapter begins with a simple example where bootstrap can be used to compute the bias of an estimator. Subsequently a detailed set of examples of using bootstrapping to improve cross validation estimate are given. These examples show that there are many instances where Bootstrapped crossvalidation technique gives a better performance than using other estimators like CV, 632 and e0 estimators. About estimating a location parameter for a random variable from a particular distribution, MLE does a great job and hence one need not really use bootstrapping. However there are cases where MLE estimates have no closed form solutions.In all such cases, one can just bootstrap away to glory. In the case of linear regression, there are two ways in which bootstrapping can be used. The first method involves residuals. Bootstrap the residuals and create a set of new dependent variables. These dependent variables can be used to form a bootstrapped sample of regression coefficients. The second method is bootstrapping pairs. It involves sampling pairs of dependent and independent variable and computing the regression coefficients. Between these two methods, the second method is found to be more robust to model misspecification.

Some of the other uses of bootstrapping mentioned in the chapter are:

  • Dealing with heteroskedastic errors by using wild bootstrap
  • Nonlinear regression
  • Non parametric regression
  • Application to CART family (bagging, boosting and random forests)

My crib about this chapter is this : You are introducing data mining techniques like LDA, QDA, bagging etc. in a chapter where the reader is supposed to get an intuition about how bootstrapping can be used to get a point estimate. Who is the target audience for this book ? A guy who is already familiar with these data mining techniques would gloss over the stuff as there is nothing new for him. A newbie would be overwhelmed by the material. For a guy who is not a newbie and who is not a data mining person, the content will be appear totally random . Extremely poor choice of content for an introductory book.

Confidence Intervals

One of the advantages of generating bootstrapped samples is that they can be used to construct confidence intervals. There are many ways to create confidence intervals. The chapter discusses bootstrap-t, iterated bootstrap, BC, BCa and tiled bootstrap. Again I don’t expect any newbie to understand clearly these methods after reading this chapter. All the author has managed to do is to give a laundry list of methods and give some description about the methods.And Yes, an extensive set of references that makes you feel that you are reading a paper and not a book. If you want to really understand these methods, the bibles mentioned at the beginning are the right sources.

Hypothesis testing

For simple examples, hypothesis testing can be done based on the confidence intervals obtained via bootstrap samples. There are subtle aspects that one needs to take care of, such as sampling from the pooled data etc. Amazing that the author doesn’t even provide some sample code to illustrate this point. The code that’s provided does sampling from individual samples. Instead code should have been provided to illustrate sampling from pooled data. Again poor choice on the way to present the content.

Time Series

The chapter gives a laundry list of bootstrap procedures in the context of time series; model based bootstrap, non overlapping block bootstrap, circular bootstrap, stationary bock bootstrap, tapered block bootstrap, dependent wild bootstrap,sieve bootstrap. Again a very cursory treatment and references to a whole lot of papers and books. The authors get it completely wrong. In an introductory book, there must be R code, there must be some simple examples to illustrate the point. Instead if you see a lot of references to papers and journal articles, the reader is going to junk this book and move on

Bootstrap variants

The same painful saga continues. The chapter gives a list of techniques – Bayesian bootstrap, Smoothed bootstrap, Parametric bootstrap, Double bootstrap, m-out-of-n bootstrap, and wild bootstrap. There is no code whatsoever to guide the reader. The explanation given to introduce these topics is totally inadequate.

When the bootstrap is inconsistent and How to remedy it ?

This chapter gives a set of scenarios when the bootstrap procedure can fail

  • For small sample sizes less than 10, bootstrapped sample is not reliable
  • Distributions that have infinite second moments
  • Estimating extreme values
  • Unstable AR processes
  • Long memory processes

 

imageTakeaway:

This is the worst book that I have read in the recent times. The authors are trying to cash in on the popularity of R. The title of the book is completely misleading. Neither is it an introduction to bootstrap methods nor is it an introduction to R methods for bootstrapping. All it does is give a cursory and inadequate treatment to the bootstrap technique. Do not buy or read this book. Total waste of time and money.

Next Page »