June 2018
Monthly Archive
June 28, 2018
Gradient Boosting Algorithm is one of the powerful algos out there for solving classification and regression problems.This book gives a gentle introduction to the various aspects of the algo with out overwhelming the reader with the detailed math of the algo. The fact that there are a many visuals in the book makes the learning very sticky. Well worth a read for any one who wants to understand the intuition behind the algo.
Here is a link to a detailed summary :
Machine Learning With Boosting – Summary
June 21, 2018
Posted by safeisrisky under
Technology
Leave a Comment
The following are some of the points mentioned in the book :
 Cloud is just a building full of computers. There are racks of computers specially built to fit in datacenters
 A cloud provider rents computers as a service. It is akin to a car rental agency
 Cloud computing is the new electricity and everyone’s fighting to be the new utility provider of choice
 Cloud Storage came first with S3(Simple Storage Service) came in 2006
 Cloud Computing began with EC2(Elastic Cloud Compute) in 2007
 Software developers have made S3 and EC2 successful
 Cloud costing – Shift from Capex to Opex
 Cloud Native – Cloudcentric approach to developing software
 Netflix – An example of Cloud Native service: Early adopter of EC2 and one of the biggest success stories of Cloud
 Computers failing in the cloud is the norm. it is expected. it is ok. This changes the way the software is developed
 Concept of multiple availability zones in a region – Makes the cloud resilient to failure
 IaaS – Rent infra from a cloud provider
 PaaS – Rent a platform – Google app engine ,databases, Elastic Search, Machine Learning Services and a ton of other platforms that are available
 SaaS – Software as a service
 Both IaaS and PaaS are services used by developers to build products. No end user will ever use IaaS or PaaS directly. They are exclusively for developers. SaaS is meant for end users to use directly.
 A VM allows many programs to run on one computer at the same time. VMs are how big, powerful computers are shared.
 Analogy of VMs and a house owner who puts his room on Airbnb and somehow manages to magically accomodate all the demand – Virtualization explained well
 The real genius of the cloud: selling previously unused computer capacity
 Containers evolved after VMs and are even better at sharing computers because they use fewer resources. Since containers use fewer resources than VMs, it’s possible to pack even more applications on a baremetal computer.
 Containers are so efficient at letting programs share a computer that two to three times more applications can share a computer using containers compared to VMs.
 Cloud computing as renting computers in the cloud. But in reality people are really renting are slices of computers using VMs. Via containers, you could rent even smaller slices of a computer using containers.
 Serverless takes renting smaller slices of a computer to the next level. With serverless, instead of running a complete application in a VM or a container, you upload a function into a serverless service and it takes care of everything else.
 Why is serverless computing the future :
 You don’t rent computers anymore. You don’t have to worry about VMs or containers. You no longer have to deal with capacity planning, deploying software, installing software, patching software, or elastically scaling in response to demand. You’re free from all that extra overhead. All you do is create and upload functions. You don’t even have to worry about failures like you do with EC2.
 Coke handles all their vending machine transactions using AWS Lambda
 Microsoft sells a private cloud product as do RedHat, VMware, and IBM.
 Netflix story of leveraging AWS to become what it is today
June 21, 2018
This book is a great quick read that highlights various aspects of Neural Network components. There are three parts to the book. The first part of the book takes the reader through the various components of a Neural Network. The second part of the book equips the reader to write a simple piece of python code from scratch(not using any offtheshelf library) that sets up a neural network, trains the neural network and test the structure for a certain set of test samples. The third part of the book adds some fun exercises that test a specific neural network.
Deep Learning as an algo and framework has become extremely popular in the last few years. In order to understand Deep Learning, a basic familiarly of Neural Networks in its original form is useful. This book acts as a good entry point for any ML newbie who wants to learn Neural Networks. Even though there are few equations mentioned in the first part of the book, the details of every component of the equation are explained in such a way that you just need to know high school math to understand the contents of the book.
The author starts with a very simple example of estimating a conversion function that takes input in kilometers and gives the output in miles. Although you might know the exact conversion between the two scales, there is another interesting way to go about solving this problem. If you know the exact value of the input in miles, then you can guess the conversion function and then check the extent of error between the function estimate and the true value. Based on the magnitude and direction of the error, one can adjust the conversion function in such a way that the output of the function is as close to the true value as possible. The way you tweak the various parameters in the conversion function depends on the derivative of the conversion function with respect to the various parameters.
In an Neural Network retraining, one needs to compute the partial derivative of the error with respect to a weight parameter and then use that value to tweak the weight parameter. Why should one do that ? What’s partial derivative of error got to do with weight parameter updates ? One feature that is particularly explained well in the book is the fact that there is a need for delayed updates for weight parameters. If you update parameters after every training sample, then the weight parameters are going to fit only the last training sample very well and the whole model might perform badly for all the other samples. By using learning rate as a hyperparameter, one can control the amount of weight updates. The rationale for something more than linear classifier is illustrating via classifying the output of XOR operator.
The analogy between water in a cup and activation function is very interesting. Observations suggest that neurons don’t react readily, but instead suppress the input until it has grown so large that it triggers an output. The idea of connecting the threshold with an activation function is a nice way to understand the activation functions. The idea that there are two functions that are being used on every node, one is the summation operator and the second is the activation function is illustrated via nice visuals. Even a cursory understanding of Neural Networks requires at least a basic knowledge of Matrices. The author does a great job of demonstrating the purpose of matrices, i.e. easy representation of the NN model as well as efficient computation of forward propagation and back propagation
The idea that the partial derivative is useful in tweaking the weight parameters is extremely powerful. One can visualize by thinking of a person trying to reach to the lowest point of a topsy turvy surface. There are many ways to reach the lowest point on the surface. In the process of reaching the lowest point, the gradient of the surface in several directions is essential as it will guide the person in moving in the right direction. The gradient gets smaller as you reach the minimum point. This basic calculus idea is illustrated via a set of rich visuals. Any person who has no understanding of partial derivatives will still be able to understand by going through the visuals.
The second part of the book takes the reader through a Python class that can be used to train and test a Neural Network structure. Each line of the code is explained in such a way that a Python newbie will have no problem in understanding the various components of the code.
The third part of the book invites the readers to test Neural network digit classification on a sample generated from their own handwriting. For fun, you can write down a few digits on a piece of paper and then check whether a NN network built trained based on x number of sample points can successfully recognize the digits in your handwriting.
The book shows a lot of visuals that show NN performance based on tweaking the hyperparameters such as learning rate, number of hidden layers etc. All of those visuals, I guess, will make any curious reader to explore this model further. In fact NN model in its new avatar, Deep Learning, is becoming a prerequisite skill for any Machine Learning engineer.
June 19, 2018
The author says that there are five things about Neural Networks that any ML enthusiast should know:

Neural Networks are specific : They are always built to solve a specific problem

Neural Networks have three basic parts, i.e. Input Layer, Hidden Layer and Output Layer

Neural Networks are built in two ways

Feed Forward : In this type of network, signals travel only one way, from input to output. These types of networks are straightforward and used extensively in pattern recognition

Recurrent Neural Networks: With RNN, the signals can travel in both directions and there can be loops. Even though these are powerful, these have been less influential than feed forward networks

Neural Networks are either Fixed or Adaptive : The weight values in a neural network can be fixed or adaptive

Neural Networks use three types of datasets. Training dataset is used to adjust the weight of the neural network. Validation dataset is used to minimize overfitting problem. Testing dataset is used to gauge how accurately the network has been trained. Typically the split ratio among the three datasets is 6:2:2
There are five stages in a Neural Network and the author creates a good set of visuals to illustrate each of the five stages:

Forward Propagation

Calculate Total Error

Calculate the Gradients

Gradient Checking

Updating Weights
Be it a convolution neural network(CNN) or a recurrent neural network(RNN), all these networks have a structureorshell that is made up of similar parts. These parts are called hyperparameters and include elements such as the number of layers, nodes and the learning rate.Hyperparameters are knobs that can tweaked to help a network successfully train. The network does not tweak these hyperparameters.
There are two types of Hyperparameters in any Neural Network, i.e. required hyperparameters and optional hyperparameters. The following are the required Hyperparameters

Total number of input nodes. An input node contains the input of the network and this input is always numerical. If the input is not numerical, it is always converted. An input node is located within the input layer, which is the first layer of a neural network. Each input node represents a single dimension and is often called a feature.

Total number of hidden layers. A hidden layer is a layer of nodes between the input and output layers. There can be either a single hidden layer or multiple hidden layers in a network. Multiple hidden layers means it is is a deep learning network.

Total number of hidden nodes in each hidden layer. A hidden node is a node within the hidden layer.

Total number of output nodes. There can be a single or multiple output nodes in a network

Weight values. A weight is a variable that sits on an edge between nodes. The output of every node is multiplied by a weight, and summed with other weighted nodes in that layer to become the net input of a node in the following layer

Bias values. A bias node is an extra node added to each hidden and output layer, and it connects to every node within each respective layer. The bias is a way to shift the activation function to the left or right.

Learning Rate. It is a value that speeds up or slows down how quickly an algorithm learns. Technically this is the size of step an algo takes when moving towards global minimum
The following are the optional hyperparameters:

Learning rate decay

Momentum. This is the value that is used to help push a network out of local minimum

Minibatch size

Weight decay

Dropout:Dropout is a form of regularization that helps a network generalize its fittings and increase accuracy. It is often used with deep neural networks to combat overfitting, which it accomplishes by occasionally switching off one or more nodes in the network.
Forward Propagation
In this stage the input moves through the network to become output. To understand this stage, there are a couple of aspects that one need to understand:
Well, specific mathematical functions are used to accomplish both the above tasks. There are two types of mathematical functions used in every node. The first is the summation operator and the second is the activation function. Every node, irrespective of whether it is a node in the hidden layer or an output node, has several inputs. These inputs have to be summed up in some way to compute the net input. These inputs are then fed in to an activation function that decides the output of the node. There are many types of activation functions Linear, Step, Hyperbolic Tangent, Rectified Linear Unit(has become popular since 2015). The reason for using activation functions is to limit the output of a node. If you use a sigmoid function the output is generally restricted between 0 and 1. If you use a tanh function, the output is generally restricted between 1 and 1. The basic reason for using activation functions is to introduce nonlinearity( Most of the real life classification problems do not have nice linear boundaries)
Once the basic mathematical functions are set up, it becomes obvious that any further computations would require you to organize everything in vectors and matrices. The following are the various types of vectors and matrices in a neural network
Calculating the Total Error
Once the forward propagation is done and the input is transformed in to a set of nodes, the next step in the NN modeling is the computation of total error. There are several ways in which one can compute this error – Mean Squared Error , Squared Error, Root Mean Square, Sum of Square Errors
Calculation of Gradients
Why should one compute gradients, i.e. partial derivative of the error with respect to each of the weight parameter ? Well, the classic way of minimizing any function involves computing the gradient of the function with respect to some variable. In any standard multivariate calculus course, the concept of Hessian is drilled in to the students mind. If there is any function that is dependent on multiple parameters and one has to choose a set of parameters that minimizes the function, then Hessian is your friend.
Backpropagation
The key idea of back propagation is that one needs to update the weight parameters and one of the ways to update the weight parameters is by tweaking the weight values based on the partial derivative of the error with respect to individual weight parameters.
Before updating the parameter values based on partial derivatives, there is an optional step of checking whether the analytical gradient calculations are approximately accurate. This is done by a simple perturbation of weight parameter and then computing finite difference value and then comparing it with the analytical partial derivative
There are various ways to update parameters. Gradient Descent is an optimization method that helps us find the exact combination of weights for a network that will minimize the output error. The idea is that there is an error function and you need to find its minimum by computing the gradients along the path. There are three types of gradient descent methods mentioned in the book – Batch Gradient Descent method, Stochastic Gradient Descent method, Mini Batch Gradient Descent method. The method of your choice depends on the amount of data that you want to use before you want to update the weight parameters.
Constructing a Neural Network – Hand on Example
The author takes a simple example of image classification – 8 pixel by 8 pixel image to be classified as a human/chicken. For any Neural Network model, the determination of Network structure has five steps :

Based on the above structure, we have 4224 weight parameters and 66 biased value weight parameters, in total 4290 parameters. Just pause and digest the scale of parameter estimation problem here.

Understanding the Input Layer

Understanding the Output Layer

The author simplifies even further so that he can walk the reader through the entire process.

Once a random set of numbers are generated for each of the weight parameters, for each training sample, the output node values could be computed. Based on the output nodes, one can compute the error

Once the error has been calculated, the next step is back propagation so that the weight parameters that have been initially assigned can be updated.

The key ingredient of Back propagation is the computation of gradients, i.e. partial derivatives of the error with respect to various weight parameters.

Gradients for Output Layer weights are computed

Gradients for Output Layer Bias weights are computed

Gradients for Hidden Layer weights are computed

Once the gradients are computed, an optional step is to check whether numerical estimate of the gradients and the analytical value of the gradient is close enough

Once all the partial derivatives across all the weight parameters are computed, then the weight parameters can be updated to new values via one of the gradient adjustment methods and of course the learning rate(hyper parameter)
Building Neural Networks
There are many ML libraries out there such as TensorFlow, Theano, Caffe, Torch, Keras, SciKit Learn. You got to choose what works for you and go with it. TensorFlow is an open source library developed by Google that excels at numerical computation. It can be run on all kinds of computer, including smartphones and is quickly becoming a popular tool within machine learning. Tensor Flow supports deep learning( Neural nets with multiple hidden layers) as well as reinforcement learning.
TensorFlow is built on three key components:

Computational Graph : This defines all of the mathematical computations that will happen. It does not perform the computations and it doesn’t hold any values. This contain nodes and edges

Nodes : Nodes represent mathematical operations. Many of the operations are complex and happen over and over again

Edges represent tensors, which hold the data that is sent between the nodes.
There are a few chapter towards the end of the book that go in to explaining the usage of TensorFlow. Frankly this is an overkill for a book that aims to be an introduction. All the chapters in the book that go in to Tensor Flow coding details could have been removed as it serves no purpose. Neither does one get an decent overview of the library not does it go in to the various details.
The book is a quick read and the visuals will be sticky in one’s learning process. This book will equip you to have just enough knowledge to speak about Neural Networks and Deep Learning. Real understanding of NN and Deep NN anyways will come only from slogging through the math and trying to solve some real life problem.
June 18, 2018
The entire book is organized as 20 small bite sized chapters. Each chapter focuses on one specific thing and explains everything via visuals(as is obvious from the title).
The author starts off by explaining the basic idea of Random Forests, i.e. a collection of decision trees that have been generated via randomization. The randomness comes from the fact that a random subset is used from training the dataset and a random set of attributes are used for splitting the data. Before understanding the
To make the understanding of random forests concrete, a simple example of classifying a set of fruits based on length, width and color is used. A decision tree for classifying fruits is drawn based on three simple attributes. This decision tree is drawn using a manual process. If the data is sufficiently scattered across various dimensions, one can use basic intuition to construct a decision tree. The result of manually drawing a decision tree is shown in various iterations. There are a few immediate takeaways from this tree, i.e. there are regions where the training data from several samples are present in the a region. This makes it difficult to conclusively say anything about the region, except in a probabilistic sense. Also there are lines in the decision space that need not necessarily be horizontal and vertical. Decision tree on the other hand only uses horizontal or vertical lines. To generate a decision tree via python, all it it takes is a few lines of code. Thanks to the massive effort behind sklearn library, there is a standardized template that one can use to build many models.
One of the disadvantages of Decision trees is the overfitting. This limitation can be obviated by the use of Random Forests. The good thing about this book is that all the data used for illustration is available on the web. One can replicate each and every figure in the book via a few lines of python code. Generating a Random Forest via sklearn is just a few lines of code. Once you have fit a Random Forest, the first thing that you might want to do is predict. There are two ways to predict the outcome of the Random Forest, one is via picking up the class that is predicted the most and the second is via obtaining class probabilities. One of the biggest disadvantages with Decision trees and Random Forest models is that they cannot be used to extrapolate.
The randomness in a random forest also comes from the fact that the number of attributes selected for splitting criteria is also random. The general rule is to pick up random set of attributes and then build a decision tree based on these randomly selected attributes. The basic idea of selecting the attributes randomly is that each decision tree is built based on a randomized set of attributes and hence there is less scope of overfitting.
One of the ways to check the validity of a model is to check its performance on a test set. Usually one employs CrossValidation method. In the case of random forests, a better way is to use Out of Bag estimator. The basic idea of O.O.B error estimate is this: If there are N data samples and a random decision tree is drawn, on an average only 2/3rd of the data sample is used. Thus there is usually 1/3rd of the dataset that is available as test data for any tree in the random forest. Hence one can compute the accuracy of each tree and average the error of all the trees, giving rise to Out of Bag error estimate.
The author also mentions two ways in which a decision tree can be split, Gini criteria and Entropy criteria. Both these are a type of impurity measures. One of the interesting features of Random Forest algo is that it gives you the relative importance scores of various attributes. Even though they are relative measures and there are some caveats in interpreting these relative scores, overall they might be useful in certain situations.
The author ends the book with a broad set of guidelines for using Random Forest
 Use crossvalidation or out of bag error to judge how good the fit is
 Better data features are the most powerful tool you have to improve your results. They trump everything else
 Use feature importances to determine where to spend your time
 Increase the number of trees until the benefit levels off
 Set the random seed before generating your Random Forest so that you get reproducible results each time
 Use a predict() or a predict_prob() where it makes sense, depending on if you want just the most likely answer, or you to know the probability of all the answers
 Investigate limiting branching either by
 Number of splits
 Number of data points in a branch to split
 Number of data points in the final resulting leaves
 Investigate looking at different numbers of features for each split to find the optimum. The default number of features in many implementations of Random Forest is the square root of the number of features. However other options, such as using a percentage of the number of features, the base 2 logarithm of the number of features, or other values can be good options.
June 17, 2018
This book provides a nonmathy entry point in to the world of decision trees and random forests.
Chapter 1 introduces the three kinds of learning algorithms:

Supervised Learning Algorithm: The algo feeds on labeled data and then use the fitted algorithm on unlabeled data

Unsupervised Learning Algorithm: The algo feeds on unlabeled data to figure out structure and patterns on its own

SemiSupervised Learning: The algo feeds on labeled and unlabeled data. The algo needs to figure out the structure and patterns of the data. However it gets some help from the labeled data.
The algorithms covered in the book fall under Supervised learning category. Before understanding the decision tree, one needs to be familiar with the basic terminology. Chapter 2 of the book goes in to the meaning of various terms such as:

Decision Node

Chance Node

End Node

Root Node

Child Node

Splitting

Pruning

Branch
Chapter 3 goes in to creating a decision tree for a trivial example using pen and paper. In the process of doing so, it touches upon various aspects of the process, i.e. splitting, determining the purity of the node, determining the class label of a node etc. All of these are subtly introduced using plain English.
Chapter 4 talks about the three main strengths and weaknesses of handdrawn decision trees.
Three main weaknesses:

Decision trees can change very quickly

Decision trees can become very complex

Decision trees can cause "paralysis by analysis"
Three main strengths:

Decision trees force you to consider outcomes and options

Decision trees help you visualize a problem

Decision trees help you prioritize
Decision Trees
Chapter 5 of the book gives a list of popular Decision tree algos. Decision tree can be used to perform two tasks: Classification and Regression, the former involves classifying cases based on certain features whereas the latter involves predicting a continuous value of a target variable based on input features. The five most common decision tree algos are,

ID3

C4.5

C5.0

CHAID

CART
Chapter 6 of the the book goes in to showcasing a simple dataset that contains movies watched by X based on certain attributes. The objective of the algo is to predict whether X will like a movie not present in the training sample, based on certain attributes. The first step in creating decision tree involves selecting the attribute based on which the root node needs to be split. The concept of "impurity" of a node is illustrated via a nice set of visuals.
Chapter 7 goes in to the math behind splitting the node, i.e using the principles of entropy and information gain. Once a node is split, one needs a metric to measure the purity of the node. This is done via entropy. For each split of an attribute, one can compute the entropy of the subset of the nodes. To aggregate the purity measures of subsets, one needs to understand the concept of Information gain. In the context of node splitting, the information gain is computed by the difference of entropies between the parent and the weighted average entropy of the children. Again a set of rich visuals are used to explain every component in the entropy formula and information gain (KullbackLeibler divergence).
Chapter 8 addresses common questions around Decision trees

How/When does the algo stop splitting?

Are there other methods to measure impurity ?

What is greedy algo ?

What if the dataset has two identical examples ?

What if there are more than 2 classes ?
Chapter 9 talks about the potential problems with Decision trees and the ways to address them

Overfitting

Information gain bias
Chapter 10 gives an overview of Decision Tree Algorithms. Algorithms differ in the way the following aspects are handled:

How does the algorithm determine what to split?

How does the algorithm measure purity?

How does the algorithm know when to stop splitting?

How does the algorithm prune?
Here is a list of popular Decision tree algos with their pros and cons:

ID3 Algorithm Iterative Dichotomiser 3, is the "grandfather" of decision tree algorithms and was developed in 1986 by Ross Quinlan, a machine learning researcher.

Pros:

Cons:

Susceptible to overfitting

Does not naturally handle numerical data.

Is not able to work with missing values.

C4.5 algorithm is the successor to the ID3 algorithm and was invented in 1993 by Ross Quinlan. It makes use of many of the same elements as the ID3 but also has number of improvements and benefits.

Pros:

Can work with either a continuous or discrete dataset. This means it can be used for classification or regression and work with categorical or numerical data

Can work with incomplete data.

Solves "overfitting" by pruning and its use of the Gain Ratio.

Cons:

Constructs empty branches with zero values.

Tends to construct very large trees with many subsets.

Susceptible to overfitting.

CART was first developed in 1984 and a unique characteristic of it is that it can only construct binary trees. ID3, C4.5 and CHAID are all able to construct multiple splits.

Pros:

Cons:

It can only split on a single variable.

Susceptible to instability.

Splitting method is biased towards nodes with more distinct values.

Overall, the algorithm can be biased towards nodes with more missing values.
Chapter 11 gives a sample python code to build a decision tree via CART.
Random Forests
A random forest is a machine learning algorithm that makes use of multiple decision trees to predict a result, and this collection of trees is often called an ensemble. The basic idea of random forest is that a decision tree is built based on a random set of observations from the available dataset.
Chapter 12 gives pros and cons of random forest algo

Pros:

More accurate than a single decision tree

More stable than a single decision tree

Less susceptible to the negative impact of greedy search

Cons
Pros and cons of Decision tree
Chapter 13 describes the basic algo behind random forest, i.e three steps. The first step involves selecting a subset of data.This is followed up by selecting random set of attributes from the bootstrapped sample.Based on the selected attributes, a best split is made and is repeated until a stopping criteria is reached.
Chapter 14 describes the way in which random forest predicts the response for a test data. There are two methods described in this chapter,i.e predicting with majority vote and predicting with mean
Chapter 15 explains the way to testing random forest for its accuracy. The method entails computing O.O.B estimate(Out of Bag error estimate).The key idea is to create a map between a data point and all the trees in which that data point does not act as a training sample. Once the map is created, for every randomized decision tree, you can find a set of data points that have not been used to train it and hence can be used to test the relevant decision tree.
Chapter 16 goes in to the details of computing attribute importance. The output of such computation is a set of relative scores for all the attributes in the dataset. These scores can be used to preprocess the data – remove all the unwanted attributes and rerun the random forest.
Chapter 17 answers some of the common questions around random forests
Chapter 18 gives a sample python code to build a random forest vis ScikitLearn library
Here are some of the visuals from the book:
I think the visuals are the key takeaways from the book. You can read about the concepts mentioned in the book in a ton of places. However you might not find adequate visuals in a book that explains the math. This book is a quick read and might be worth your time as visuals serve as a power aid for learning and remembering concepts.