Part I – Fundamentals
The first part of the book creates the necessary background for the reader to understand the various algorithms discussed in the book. The book implements all the algorithms in Java. Its not a problem if you are rusty with Java. The authors give some basic crash course on the language, just enough to understand all the algorithms. I have forgotten Java long back. I think it has been a decade since I have coded any Java program. However the code and the commentary for the code is crystal clear thus making it is easy for any nonjava reader to follow the code and implement in a language of their choice. There are two essential aspects that are discussed in the first part of the book. First, it explains the main data structures like Bags, Queues, Linked lists, Doubly linked listed that are essential building blocks for all the algorithms in the book. Secondly, it explains the concepts behind analyzing the performance of an algorithm. When I look at these concepts, I think these need to understood just once and they are yours for lifetime. Basic principles of performance analysis are described in detail and I guess they ensure that every time you come across an algorithm, the first question that typically will pops up in your mind is, "what’s the order of running time ?” . Answering this question allows one to connect with various algorithms that have similar running time. For example FFT algorithm is of O(N log N). Well, FFT achieves N log N by the same divide and conquer method that a merge sort algorithm uses.
Part II – Sorting
All sorting algorithms can be categorized in to two types. The first type of algos are “inplace” sorting algos and second type are the ones that need extra space. The section starts off by discussing elementary sort algos such as selection sort, insertion sort and shell sort. The running time for selection sort and insertion sort is quadratic whereas shell sort is O(N^{3/2}). Once these elementary algos are dealt, the book moves on to discussing merge sort and quick sort. Both have linearithmic running times. There are two versions of Merge sort discussed in the book – the top down merge sort as well as bottom up merge sort, the usage depending on the context and need. Merge sort is a stable sorting algorithm and but it is not optimal with respect to space usage. The book then talks about Priority queues that are fundamentally different data structures. In all the previous sorting algos, the storage is typically an array and there is no specific structure in the array. In Priority queues, the keys are stored in an array but with a certain structure for the array. Even though the data is stored in an array, the structure is a tree structure.
The chapter on priority queues starts off with a practical problem of computing a summary statistic for a data stream.There are applications where keys are processed not necessarily in an ordered format and not necessarily all at once. Basically you process a few data items , then process the items with the largest key or smallest key and then insert new data items. Think about a stream of data where there is a need for random inserts and random delete maximum or delete minimums operations from a random stream of data. A concrete example is one where there are large number of strings that represent transactions and the problem is to keep the top M transactions by amount. Sorting is ruled out as N is very large. Store M largest seen so far is one approach but this fails if M is very large. A client that uses elementary implementation faces O(NM) in time , O(M) in space. Unordered Array based implementation is poor as insertion is O(1) but deletion is O(N). Ordered Array based implementation is poor as insertion is O(N) and deletion is O(1). So, what’s the way out of this linear time insertion problem ? The solution lies in using a different data structure, the heap. A client that uses heapbased implementation faces O(N log M) in time and O(M) in space.
The book discusses two variants of Priority Queues, i.e. MaxPQ and MinPQ, the former always creates a parent node whose key is greater than that of children’s keys and the latter creates a parent node whose key is always less than the children’s keys. Using two helper operations called sink and swim operations, priority queues become very efficient data structures that operate on a linearithmic running time. Extending the concept of priority queue gives another sorting algorithm called the heap sort. The classical heapsort algorithm was invented by J. W. J. Williams and refined by R. W. Floyd in 1964. Heap sort has two phases. First phase is the heap construction and second phase is the sort down procedure. The sort down procedure is where most of the work gets done. The procedure is similar to selection sort but is very efficient because of heap structure. The true advantage of heapsort is that it does not require additional space. It does inplace sort and it is linearithmic. The downside to heap sort is its poor cache performance and hence it is rarely used in modern applications. The book has a lot of interesting programming problems. It all depends on how much time and effort you can invest in to the book. One of the problems I liked is the use of Priority Queue to compute Kendal’s tau.
Part – III: Symbol Tables
The third part of the book covers Symbol table, the building block for any search algorithm. A symbol table is a term used to describe an abstract mechanism for saving information(value) and retrieving the information by specifying a key.Symbol tables are sometimes called dictionaries or indices. Specifying a key and value can be done in umpteen number of ways. But it is important that one does it in an efficient manner. One easy way is to store two arrays one storing the key and the other storing the value. This is inefficient if the data involves random “puts” and “gets”. Any “get” becomes is linear time and hence becomes an inefficient algo. The two important operations for a symbol table are insert() and search(). Several symbol table comparisons take advantage of the order among the keys. Once there is an order amongst keys, the basic API involving put and get can be expanded to many other operations like min, max, floor, ceiling, rank, select, deleteMin, deleteMax, range etc.
The clients for Symbol Tables typically share the following characteristics

Search and insert operations are intermixed

Number of distinct keys is not small

Substantially more searches than inserts are likely

Search and insert patterns, through unpredictable are not random
The first symbol table implementation in the book is via unordered linked list. It is obvious that insert is O(1) whereas search is O(N). Subsequently the chapter implements ST using ordered linked list. In this case, using a rank procedure the search cost becomes logarithmic but the insertion cost remain linear time operation. After these elementary implementations, it is abundantly clear that these don’t’ scale. A fundamentally different data structure is needed to handle big data. Binary search trees are introduced in the next section of the book to address slow insertion issue for the elementary implementations
The best thing about Binary search trees is that insert and search operations are logarithmic in time. In fact it is is 1.39 log N. Delete operation is the complicated operation amongst all. In BST all operations take time proportional to the height of the tree. So, except for the worst case scenario, BST is much better than elementary implementations. In order to guarantee logarithmic time search and insert operations for the worst case scenario too, the book introduces a new data structure, a tweak of binary search trees that allows 2 keys to be at a node. These are called 23 trees and all the worst case situations are efficiently handled by 23 trees. However implementing 23 trees is done by RedBlack trees. Every Red Black tree has a one to one relationship with a 23 tree. The reason for it being called a Red Black tree : the node carries a color attribute that tells whether they belong to the same node in the corresponding 23 tree. The implementation details of a Reb Black tree is complicated, especially the delete operation. However if you sweat it out and implement it, you get a data structure that guarantees logarithmic performance in the average case as well as the worst case. These data structures are like high powered tools in your tool box to attack big problems. One of the interesting problems mentioned in the book is the reverse index for a big data set. If you have to code from scratch and not use some off the shelf functions, the best way to build an inverted index is to use Balanced Search trees. The last part of this section talks about a data structure that uses hashing to maintain key value pairs. Two types of data structures are discussed, first is the Separate Chaining Hash Symbol table and second is the Linear Probe Hash Symbol table. In the former you maintain separate symbol table takes for every hash key. In the latter , you maintain a single symbol table and manage all the collisions.
The following table summarizes all the algorithms discussed under symbol tables. As one can see, the more sophisticated data structure that one uses, the better handle one gets on the running times.
The first three parts of book takes up 500 pages of this massive 1000 page book. The other parts of the book cover algorithms based on graphs and strings. Hopefully will summarize the subsequent 500 pages of this book someday. The highlight of this book is “rich visuals” that explain the nuts and bolts of algorithms. I guess the natural progression for the reader would be to overlay a OOPS class hierarchy over the several individual classes mentioned in the book.
“Elements of Statistical Learning” (ESL) is often referred to as the “bible” for any person interested in building statistical models. The content in ESL is dense and the implicit prerequisites are good background in linear algebra, calculus and some exposure to statistical inference and prediction. Also, the visuals in ESL come with no elaborate code. If you don’t want to take them at face value and would like to check the statements / visuals in the book, you have got to sweat it out.
This book has four coauthors and two of them are coauthors of ESL, Trevor Hastie and Robert Tibshirani . Clearly the intent of the book is to serve as a primer to ESL. The authors have selected a few models from each of the ESL topics and have presented in such a way that anybody with an inclination to model can understand the contents of the book. The book is an amazing accomplishment by the authors as the book does not require any prerequisites such as linear algebra, calculus etc. Even the knowledge of R is not all that a prerequisite. However if you are not rusty with matrix algebra, calculus, probability theory , R etc., you can breeze through the book and get a fantastic overview of statistical learning.
I happened to go over this book after I had sweated my way through ESL. So, reading this book was like reiterating the main points of ESL. The models covered in ISL are some of the important ones from ESL. The biggest plus of this book is the generous visuals sprinkled throughout the book. There are 145 figures in a 440 page book, i.e. an average of one visual for every three pages. That in itself makes this book a very valuable resource. I will try to summarize some of the main points from ISL.
Introduction
Statistical Learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. The difference between the two is that in the former, there is a response variable that guides the modeling effort whereas in the latter there is none. The chapter begins with a visual exploration of three datasets; the first dataset is wage data set that is used to explore the relationship between wage and various factors, the second data set is the stock market returns data set that is used to explore the relationship between the up and down days of returns with that of various return lags, and the third dataset is a gene expression dataset that serves an example of unsupervised learning. They try to summarize the gamut of statistical modeling techniques available for a modeler.There are 15 datasets used throughout the book. Most of them are present in ISL package on CRAN.
Some historical background for Statistical Learning

19th century – Legendre and Gauss method of least squares

1936 – Linear Discriminant Analysis

1940 – Logistic Regression

1970 – Generalized Linear Model – Nelder and Wedderburn

1980s – By 1980s computing technology improved Breiman, Friedman, Olshen and Stone introduced classification and regression trees. Also principles of crossvalidation were laid out.

1986 – Hastie and Tibshirani coined the term generalized additive models in 1986 for a class of nonlinear extensions to generalized linear models, and also provided a practical software implementation.

In the recent years, the growth of R has made the availability of these techniques to a wide range of audience.
Statistical Learning
While building a model, it is helpful to keep in mind the purpose of the model. Is it for inferential purpose? Is it for prediction? Is it for both? If the purpose is to predict, then the exact functional form is not of interest as long as the prediction does a good job. If the model building is targeted towards inference, then exact functional form is of interest as it says a couple of things :

What variables amongst the vast array of variables are important?

What is the magnitude of this dependence?

Is the relationship a simple linear relationship or a complicated one?
In the case of a model for inference, a simple linear model would be great as things are interpretable. In the case of prediction, non linear methods are good too as long as they give a good sense of predictability. Suppose that there is a quantitative response and there are p different predictors, . One can test out a model that build a relationship between and . The model would then be
where represents the systematic information that X provides about y.
The model estimate is represented by . There are two types of errors, one called the reducible error and the other called the irreducible error.
The above can further be broken down
So, whatever method you choose for modeling, there will be a seesaw between bias and variance. If you are interested in inference from a model, then linear regression, lasso, subset selection have medium to high interpretability. If you are interested in prediction from a model, then methods like GAM, Bagging, Boosting, SVM approaches can be used. The authors give a good visual that summarizes this tradeoff between interpretability and prediction for various methodologies.
The following visual summarizes the position of various methods on bias variance axis.
When inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. Like in the case of an algo trading model, it does not matter what the exact form of the model is as long as it is being able to predict the moves correctly. There is a third variation of learning besides the supervised and unsupervised learning. Semi supervised learning is where you have response variable data for a subset of your entire data but is not present for the rest of data.
The standard way to measure model fit is via mean squared error
This error is typically less on training dataset. One needs to check the statistic on the test data and choose that model that gives the least test MSE. There is no guarantee that the method that has the least error on the training data will also have the least error on the test data.
I think the following statement is the highlight of this chapter:
Irrespective of the method used, as the flexibility of the method increases, it results in a monotonically decreasing error for the training data and U shaped curve for the test data for the test data
Why do we see an inverted U curve? As more flexible methods are explored, variance will increase and bias will go down. As you start with more flexible methods, the bias starts to decrease more than the increase in variance. But beyond a point, the bias hardly changes and variance begins to increase. This is the reason for the U curve for the test MSE.
In the classification setting, the golden standard for evaluation is the conditional probability
for various classes. This simple classifier is called the Bayesian classifier. The Bayes classifier’s prediction is determined by the Bayes decision boundary. The Bayes classifier provides the lowest possible rate. Since the Bayes classifier will always choose the class for which the conditional probability is the largest, the error rate has a convenient form.The Bayes error rate is analogous to the irreducible error. This connection between Bayes error rate and irreducible error is something that I had never paid attention to. For real data, one does not know the conditional probability. Hence computing Bayesian classifier is impossible. In any case Bayes classifier stands as the gold standard against which to compare methods, for simulated data. If you think about the K nearest neighbor methods, as the number of neighbors increases, the bias goes up and variance goes down. For a neighbor of size 1, there is no bias but a lot of variance. Thus if you are looking for a parameter to characterize the increasing flexibility, then 1/K is a good choice. Typically you see a U curve for the test data and then you can zoom in on to the K value that yields minimum MSE.
Linear Regression
This chapter begins with a discussion about the poster boy of statistics, linear regression. Why go over linear regression? The chapter puts it aptly,
Linear regression as a good jumping off point for newer approaches.
This chapter teaches one to build a linear models with hardly mention on linear algebra. Right amount of intuition and right amount of R code drives home the lesson. A case in point is illustrating the case of spurious regression via this example:
Assuming that you are investigating shark attacks. You have two variables ice cream sales and temperature. If you regress shark attacks with ice cream sales, you might see a statistical relationship. If you regress shark attacks with temperature, you might see a statistical relationship. If you regress shark attacks with ice cream sales and temperature, then you might see ice cream sales variable being knocked off? Why is that?
Let’s say there are two correlated predictors and one response variable. If you regress the response variable with each of the individual predictors, the relationship might show a statistically significant relationship. However if you include both the predictors in a multiple regression model, one might see that only one of the variables is significant. In a multiple regression scenario, some of the correlated variables get knocked off for the simple reason that the effect of the predictor is measured keeping all else constant. If you want to estimate the coefficient of ice cream sales, you keep the temperature constant and check the relationship between the shark attacks and ice cream sales. You will immediately see that there is no relationship whatsoever. This is a nice example (for class room purpose) that drives home the point that one must be careful about correlation between predictors.
Complete orthogonality amongst predictors is an illusion. In reality, there will always be a correlation between predictors. What are assumptions behind a linear model?. Any simple linear models has two assumptions, one being additivity, the other being linearity. In additivity assumption, the effect of changes in a predictor on the response is independent of the values of the other predictors.In linearity assumption, the change in the response due to a oneunit change in is constant, regardless of the value of .
Why are prediction intervals wider than confidence intervals ? Something that is not always given a nice explanation in words. The chapter puts it nicely. Given a model you will be able to predict the response up to irreducible error. What this means is that even if you know the true population parameters, you model says that there is an irreducible error. You can’t escape this error. For a given set of covariates, one can form two sets of confidence bands. The first type of confidence bands is called confidence intervals where the bands are for training data. The prediction interval is wider than the confidence interval as it also includes the systematic error. Confidence interval tells what happens on an average whereas prediction interval tells what happens for a specific data point.
Basic inference related questions

Is at least one of the predictors useful in predicting response ? The classic way to test this do an anova so that it throws the relevant statistic. The summary table of the linear regression reports p values that are partialled out effects. So, in the contexts of fewer predictors, it is ok to see the pvalue and be happy with it. However if there are large number of predictors, by sheer randomness some of the coefficients can show statistical relationship. In all such cases, one needs to look at the F statistic.

Do all the predictors help explain
? Forward selection, Backward selection or Hybrid methods can be used.

How well does the model fit the data Mallows
, AIC, BIC, Adjusted
. These are all the metrics that give an estimate of test error based on training error

Given a set of predictors, what response values should we predict and how accurate is our prediction?
Clearly there are a ton of aspects that go wrong when you fit real life data, some of them are :

Non linearity of the responsepredictor relationships – Residual Vs Fitted plots might give an indication of this. transforming the independent variables can help.

Correlation of error terms: In this case the error matrix has a structure and hence you can use generalized least squares to fit a model. Successive terms can be correlated. This means that you have to estimate the correlation matrix if truly has that assumed correlation This is an iterative procedure. Either you can code it up or conveniently use nlme package.

Non constant variance of error terms – If you see a funnel shaped graph for variance of error terms, then a log transformation of response variable can be used

Outliers in the model bloat the RSE and hence all the interpretations of confidence intervals become questionable – Plot studentized residuals to cull out outliers

High Leverage points : Outliers are more relevant to unusual response variable values where leverage points are unusual covariate data points.There are standard diagnostic plots to identify these miscreants.

Collinearity : Because of collinearity, for similar value of RSS, different estimates of coefficients can be chosen. The joint distribution is a narrow valley and hence the parameter estimates are very unstable. This basically has to do with design matrix not being a full column rank.
might be almost singular matrix. Hence some of the eigen values of the matrix could be very close to 0. Hence the best way to understand collinearity is to compute variance inflation factor. faraway package has a function vif that automatically gives vif for all the predictors. Well , nothing much can be done to reduce the collinearity as the data has already been collected. All one can do is to drop the collinear column or take a linear combination of such columns and replace the columns by a single column.
The chapter ends with a section on KNN regression. KNN regression is compared with linear regression to comment on the non parametric vs. parametric modeling approaches. KNN regression works well if the true model is non linear. But if the number of dimensions increases, this method falters. Using the bias variance trade off graph for training and test data given in the chapter , one can see that KNN performs badly if the true data is linear and K is small. If K is very large and the true model is linear, then KNN method works as well as a parametric model.
The highlight of this chapter as well as this book is readily available visuals that are given to the reader. All you have to do is study them . If you start with ESL instead of this book, you have to not only know the ingredients but also code it up and check all the statements for yourself. This book makes life easy for the reader.
Classification
I sweated through the chapter on classification in ESL. May be that’s why, I found this chapter on classification a welcome recap of all the important points and in fact a pleasant experience to read through. The chapter starts off with illustrating the problem of using a linear regression for a categorical variable. Immediately it shows that a variation of linear model, i.e. logit model . However if the categorical variable has more than 2 levels, then the performance of multinomial logit starts to degrade.
The chapter gives a great example of paradox. The dataset used to illustrate this is the one that contains default rate, credit card balances, income and a categorical variable student. If you regress the default rate against student variable, you find that the variable is significant. If you regress default rate with balance and student variable, the student variable is kicked out. This is similar to shark attackstemperatureicecream sales case. There is an association between the balance and default rate. The fact that student comes up as a statistically significant variable is due to the fact that the student variable is correlated with balance variable. Here is a nice visual to illustrate confounding.
Visual on the left says at any given level the default rate of student is less than that of nonstudent. However average default rate of student is more than the average default rate of nonstudent. This is mainly because of the confounding variable. If you look at the visual on the right, the students typically carry a lot of balance and that is variable that is driving the default rate. The conclusions from this is : On an average students are risky than nonstudents if you don’t have any information. But given a specific balance, the default rate of nonstudents is more than students. I am not sure but I think this goes by the name of “ Simpson’s Paradox”
In cases where there are more than two levels for the categorical variable, it is better to go for a classification techniques. What’s the basic idea? You model the distribution of the predictors X separately in each of the response classes and then use Bayes to flip these around in to estimates for posterior probabilities of the response variable belonging to a specific class. LDA and QDA are two popular techniques for classification. LDA assumes a common covariance structure of the predictors for each of the separate levels of the response variable. It then uses Bayes to come up with a linear discriminant function to bucketize each observation. Why prefer LDA to Logistic in certain cases?

When the classes are wellseparated, the parameter estimates for the logistic regression model are surprisingly unstable. LDA does not suffer from this problem

If sample size is small and the distribution of the predictors is approximately normal in each of the classes, the LDA is again more stable than logistic regression model

LDA is popular when there are more than two responses
QDA assumes separate covariance matrix for each class and is useful when there is a lot of data and variance is not of great concern. In LDA, one assumes same covariance matrix and such an assumption is useful when the dataset is small and there is a price to pay for estimating parameters. If sample size is small LDA is a better bet than QDA. QDA is recommended if the training set is large, so that the variance of the classifier is not a major concern
There are six examples that illustrate the following points

When the true decision boundaries are linear, then LDA and logistic regression approaches will tend to perform well

When boundaries are moderately nonlinear, QDA may give better results

For much more complicated boundaries, a non parametric approach such as KNN can be superior. But the level of smoothness for non parametric method needs to be chosen carefully.

Ideally one can think of QDA as the middle ground between LDA and KNN type methods
The lab at the end of the chapter gives the reader R code and commentary to fit logit models, LDA model, QDA model, knn model. As ISL is a stepping stone for ESL, it is not surprising that many other aspects of classification are not mentioned. A curious reader can find regualarized discriminant analysis, connection between QDA and Fischer’s discriminant analysis and many more topics in ESL. However in the case of ESL, there is no spoon feeding. You have to sweat it out to understand. In ESL, there is a mention of an important finding. Given that LDA and QDA make such a harsh assumption on the nature of the data, it is surprising to know that they are two most popular practitioner’s methods.
Resampling Methods
This chapter talks about Resampling methods, the technique that got hooked me on to statistics. It is an indispensable tool in modern statistics. This chapter talks about two most commonly used resampling techniques : crossvalidation and bootstrap
Why crossvalidation ?

Model Assessment : It is used to estimate the test error associated with a given statistical learning method in order to evaluate its performance.

Model Selection : Used to select the proper level of flexibility. Let’s say you are using KNN method and you need to select the value of K. You can do a cross validation procedure to plot out the test error for various values of K and then choose the K that has the minimum test error rate.
The chapter discusses three methods, one being validation set approach, the second one being LOOCV( Leave one out cross validation ) and kfold cross validation method. In the first approach, a part of data set is set aside for testing purpose. Flexibility parameter is chosen in such a way that MSE is minimum for that level of flexibility.
What’s the problem with Validation set approach?

There are two problems with validation set approach. The test error estimate has a high variability as it depends on how you choose your training and test data. Also in the validation set approach, only part of the data is used to train the model. Since the statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate.

LOOCV has some major advantages. It has less bias. It uses almost the entire data set. It does not overestimate the test error rate. With least square or polynomial regression, LOOCV has an amazing formula that cuts down the computational time. It is directly related to hat values and residual values. But there is no readymade formula for other models and thus LOOCV can be computationally expensive. As an aside if you use LOOCV to choose the bandwidth in Kernel density estimation procedure, all the hard work of that FFT does is washed away. It is no longer O(n log n).

k fold CV is far better from a computational stand point. It has some bias as compared LOOCV but it has lesser variance that LOOCV based estimate. So, in the gamut of less bias more variance more bias less variance, a 5 fold CV or a 10 fold CV has been shown to yield test error rates that do not suffer from excessively high bias or high variance.
Most of the times we does not have the true error rates. Nor do we have the Bayesian boundary. We are testing out models with various flexibility parameters. One way to choose the flexibility parameter is to estimate test error rate using k fold CV and choose the flexibility parameter that has the lowest test error rate. As I have mentioned earlier, the visuals are the highlight of this book.
The following visual shows the variance of test error estimate via validation set method
The following visual shows the variance of test error estimate via k fold CV. The variance of estimate is definitely less than validation method
The following visual shows the selecting a flexibility parameter based on LOOCV or k fold CV will yield almost similar results
The chapter ends with an introduction to bootstrap, a method that samples data from the original data set with replacement and repeatedly computes the desired parameters. This results in a realization of the parameter sample from which standard error of the parameter can be computed.
I learnt an important aspect of bootstrapping. The probability that a specific data point is not picked up in bootstrapping is 1/e. This means that even though you are bootstrapping away to glory, there are 1/3 of the observations that are not touched. This also means that when you try to build a model on one bootstrapped sample, you can actually test that model on the 1/3 of observations that haven’t been picked up. In fact this is exactly the idea that is used in bagging, a method to build regression trees. Well, you can manually write the code for doing cross validation, be it LOOCV or k fold CV. But R gives all of them for free. The lab mentioned at the end of the chapter talks about boot package that has many functions that allow you do to cross validation directly. In fact after you have worked with some R packages , you will realize that most of the packages have the cross validation argument in the modeling functions as it allows the model to pick the flexibility parameter based on k fold cross validation
After going through R code from the lab, I learnt that it is better to use subset argument in building a model on a subset of data. I was following a painful process of actually splitting the data in to test and trained sample. Instead one can put in the index of observations that are meant for training data, and model will be built on those observations only. Neat!
Linear Model Selection and Regularization
Alternative fitting procedures to least squares can yield better prediction accuracy and model interpretability. If there is no unique least squares coefficient. However shrinking the estimated coefficients, one can substantially reduce the variance of the coefficients, at the cost of a negligible increase in bias. Also least squares usually fits all the predictors. There are methods that help in the feature or variable selection.
There are three alternatives to least squares that are discussed in the book.
In the Best subset selection, you choose amongst models. Since this is computationally expensive, one can use forward / backward or hybrid methods. The number of models fitted in a forward / backward regression model is . How does one choose the optimal model? There are two ways to think about this. Since all the models are built on a dataset and that’s all we have, we can adjust the training error rate as it usually a poor estimate of test error rate. So, there are methods like Mallows , AIC, BIC, Adjusted . All these results are obtained via regsubsets function from the leaps package.
The other method is via Validation and Cross validation. Validation method involved randomly splitting the data and training the data on one set and testing the trained model on the test dataset. The model that gives the least test error is chosen. K fold cross validation method means splitting the data in to k segments, training the model using k1 segments and testing it out on the held out segment and redoing this k times and then averaging out the error. Both these methods give a direct estimate of the true error and do not make distributional assumptions. Hence these are preferred over methods that adjust training error. An alternative to best subset selection is a class of method that penalizes the values the predictor coefficients can take. The right term to use is “regularizes the coefficient estimates”. There are two other methods discussed in the text, one being ridge regression and other being lasso. The penalizing function is an penalty for ridge regression whereas it is penalty for lasso.
The math behind the three types of regression can be succinctly written as
Ridge regression
lasso regression
Best subset regression
Some basic points to keep in mind

One should always standardize the data. For least square the estimates are scale invariant. But ridge regression and lasso are dependent on
and the scaling of the predictor.

Ridge regression works best where the least squares estimate have high variance. In Ridge regression none of the coefficients have exactly 0 value. In lasso, the coefficients are taken all the way to 0. Lasso is more interpretable than Ridge regression.

Neither Ridge nor Lasso universally dominate the other. It all depends on the true model. If there are small number of predictors and the rest of predictors have small coefficients lasso performs better than ridge. If the modeling function is dependent on many predictors, then ridge performs better than lasso.

The best way to remember lasso and ridge is that lasso shrinks the coefficients by the same amount whereas ridge shrinks the coefficients by the same proportion. In either case the best way to choose the regularization parameter is via cross validation.

There is Bayes angle to the whole discussion on lasso and ridge. If you think from a Bayesian point of view, lasso is like assuming a Laplace prior for betas where ridge is like assuming a Gaussian prior for betas
The other method that is discussed in the chapter fall under dimension reduction methods. The first such method is Principal component regression. In this method, you create various linear combination of predictors in such a way that each of the components has maximum variance of the data. All dimension reduction methods work in two steps. First, the transformed predictors are obtained. Second, the model is fit using these predictors. There are two ways to interpreting PCA directions. One is the direction along which the data shows the maximum variance. The second way to interpret is the directions along which the data points are closest.
What does this visual say ? By choosing the right direction, one gets the predictors having maximum variance in this direction. This means the coefficients are going to be more stable if there is maximum variance shown by the predictor. The standard error of is . In all the cases when predictors are close, the standard errors are bloated. By choosing the directions along the principal components, standard error of is low.
Observe these visuals
The first component captures most of the information that is present in the two variables. The second principal component has nothing much to contribute. By regressing using these principal components, the standard errors of betas are less,i.e., the coefficients are stable. Despite having gone through Partial least squares regression in some of the previous texts, I think the following visual made all come together for me.
This shows the PLS based regression chooses a slightly tilted direction as compared to PCR. The reason being, the component directions are chosen in such a way that the response variable is also taken in to consideration. The best way to verbalize PCLS is that it is a “supervised alternative” to PCR. The chapter gives the basic algo to perform PLS
Towards the end of the chapter, the book talks about high dimensional data where . Data sets containing more features than observations are often referred to as highdimensional. In such a dataset, the model might overfit. Imagine fitting a linear regression model with 2 predictors and 2 observations. It will be a perfect fit and hence residuals will be 0 and the will be 100%. When , a simple least squares regression line is too flexible and hence overfits the data.
Your training data might show that one should use all the predictors, but all it says that it is just fitting one model out of a whole lot of models to describe the data. How to deal with high dimensional data? Forward stepwise selection, ridge regression, lasso, and principal component regression are particularly useful for performing regression in the high dimensional setting. Despite using these three methods, there are three things that need to be kept in mind

Regularization or shrinkage plays a key role in highdimensional problems

Appropriate tuning parameters selection is crucial for good predictive performance

The test error tends to increase as the dimensionality of the problem increases, unless the additional features are truly associated with the response
Another important thing to be kept in mind is in reporting the results of highdimensional data analysis. Showing training is a nono, as it does not paint the right picture. Instead it is better to report the cross validation error or results on an independent data set.
Moving Beyond Linearity
In all the models considered in the previous chapters, the form has been a linear form. One can think of extending the linear model and here are a few variants
 Polynomial regression : This extends the linear model by adding powers of predictors
 Step functions cuts the range of variable in to K distinct regions in order to produce a qualitative variable
 Regression splines are more flexible than the above two. They involve dividing the range of X in to K distinct regions. Within each region, a polynomial function is fit to the data. However these polynomials are constrained so that they join smoothly at the region boundaries, called knots. Splines are popular because human eye cannot identify the discontinuity at the knots.
The following gives a good idea about regression splines
In the top left , you specify unconstrained cubic polynomials. In the top right, you impose a constraint of continuity at the knot. In the bottom left you impose a constraint on continuity, the first derivative and the second derivative. A cubic spline uses a total of degrees of freedom. A degreed spline, is one that is piecewise degreed polynomial, with continuity in derivatives up to degree d1 at each knot. Fitting regression spline can be made easy by fitting a model with truncated power basis function per knot.
To cut the variance of splines, one can use another constraint at the boundary that results in natural splines.
What’s the flip side of Local Regression ? Local regression can perform poorly if number of predictors is more than 3 or 4. This is the same problem that KNN suffers from in the case of high dimensionality.
The chapter ends with GAM (Generalized Additive Models) that provide a general framework for extending a standard linear model by allowing nonlinear functions of each of the variables, while maintaining additivity.
There are a ton of functions mentioned in the lab section that equips a reader to estimate the parameters of polynomial regression models, cubic splines, natural spline, smoothing spline, local regression etc..
Tree Based Methods
What’s the basic idea? Trees involve stratifying or segmenting the predictor space in to a number of simple regions. In order to make a prediction, one typically uses the mean or the mode of the training observations in the region to which it belongs. Tree based methods are simple and useful for interpretation. However they are not competitive with the best supervised learning approaches. Hence a better alternative is creating multiple trees and then combining all of them in one big tree.
The chapter starts with Decision trees. These can be applied to both regression and classification tasks. In regression task, the response variable is continuous whereas in classification, the response variable is discrete. In the case of regression trees, the predictor space is split into distinct and nonoverlapping regions . The goal is to find boxes such that
where the mean response rate within the jth box. To make it computationally practical, a recursive binary splitting method is followed. A topdown greedy algorithm is used. At each step the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. The problem with this approach is that the fitted tree could be too complex and hence it might perform poorly on the test data set. To get over this problem, one of the approaches is to build the tree and then prune the tree. This is called Cost complexity pruning. The way this is done is similar to lasso regression.
Classification tree is similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative response. Since one is not only interested in the class prediction for a particular terminal node, but also the class proportions among the training observations that fall in to the region. RSS cannot be used. What one needs to use classification error rate. The classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class. There are two other metrics used for splitting purposes (measures of node purity)

Gini Index

Crossentropy index
The key idea from this section is this :
One can fit a linear regression model or a regression tree. The relative performance of one over the other depends on the underlying true model structure. If the true model is indeed linear, then linear regression is much better. But if the predictor space is non linear, it is better to go with regression tree. Even though trees seem convenient for a lot of reasons like communication, display purposes etc., the predictive capability of trees is not all that great as compared to regression methods.
The chapter talks about bagging, random forests and boosting, each of which are dealt at length in ESL. Sometimes I wonder whether it is better to understand a topic in depth before getting an overview of it or the other way round. The problem with decision trees is that it is has high variance. If the training and test data are split in a different way, a different tree is built. To cut down the variance, bootstrapping is used. You bootstrap the data, create a tree and aggregate against all the bootstrapped trees.
There is a nice little fact mentioned in one of the previous chapters that says that the probability of a data point that is not selected in a bootstrapping exercise approaches e. Hence for every bootstrap, there are 1/3^{rd} of the training points that are not selected for tree formation. Once the tree is formed for that bootstrapped sample, these Out of bag sample can be tested and over multiple bootstraps, you can form an error estimate based for every observation. A tweak on bagging that decreases the correlation between the bootstrapped trees is random forests. You select at random a set of parameters; build a tree and then average over many such boot strap samples. The number of predictors chosen is typically square root of the number of parameters. The chapter also explains Boosting that is similar to bagging but learns slowly.
Unsupervised Learning
This chapter talks about PCA and clustering methods. At this point in the book, i.e. towards the end of the book, I felt that it is better to sweat it out with ESL and subsequently go over this book. The more one tries to sweat it out with ESL , the more ISL will turn out to be a pleasant read.. May be I am biased here. May be the right way could ISL to ESL. Whatever be the sequence, I think this book will serve as an apt companion to ESL for years to come.
Matrices are everywhere in statistics. Every linear model involves some kind of a generalized inverse calculation of the design matrix. Statisticians, data analysts etc. are always dealing with datasets that might have repeated measurements, repeated observations, noise etc.
The matrix is never full row rank matrix. This means that the traditional inverse applicable for a square matrix has to give way to something less restrictive and hence a generalized inverse is needed. is not unique. If you find one, you can manufacture a ton of other solutions to the set of equations. In some of the texts on linear algebra course this is often introduced as Pseudo inverse , the name implying that it is not exactly an inverse. The generalized inverse arises from a matrix that satisfies the first of the four MoorePenrose conditions.
One can compute an inverse of a square matrix in a number of ways. However computing is tricky. One procedure that I knew before reading this book is the SVD. Factorize the matrix using singular value decomposition and compute the pseudo inverse. This book gives a nice algorithm to compute G that one can use it to work out for smaller dimensional matrices and get a feel of the various solutions possible. Why work on smaller dimensional matrices to compute ? Well, as they say we do not learn to solve quadratic equations by working with something like just because it occurs in real life. Learning to first solve is far more instructive. Similarly working with a tractable algorithm that computes gives the understanding to use more sophisticated algos. Multiple regression model estimation depends on the normal equations .To solve the set of equations, one needs to know the generalized inverse of Ok, enough about as this book is not about a specific matrix. The book focuses on matrix theory and not on computations. So, all you need to work through this book is pen and paper.
The book was published in 1982 and here is the blurb about the author from wiki
Shayle Robert Searle Ph.D. (April 26, 1928 – February 18, 2013) was a New Zealand mathematician who was Professor Emeritus of Biological Statistics at Cornell University. He was a leader in the field of linear and mixed models in statistics, and published widely on the topics of linear models, mixed models, and variance component estimation. Searle was one of the first statisticians to use matrix algebra in statistical methodology, and was an early proponent of the use of applied statistical techniques in animal breeding.
The author does not talk about statistics right away in the book. The book has about 350 pages of theory before any mention statistical applications. For those looking for some crash course kind of thing in linear algebra to statistics, this book is nono. I was in that category few years ago and slowly I realized that it is not satisfying to have shallow knowledge. Well, may be nobody in the world cares if you know the derivation of the distribution of a quadratic form. However spending time understanding such things gives a nice sense of comforting feeling that is hard to express in words. For me, there is some sense of satisfaction once I understand stuff at a certain depth. For example, I saw the ANOVA output much before I could understand that exactly going behind it. In a typical undergrad course or even in some grad courses, that’s how things are introduced and explained. At least that’s what happened in my case. But much later in life I learnt that each line of ANOVA output is nothing but the projection on to orthogonal components of the predictor subspace and the lines are a fancy way of representing a version of Pythagorean Theorem. Well, that’s not exactly the point of ANOVA though. Splitting the norm square in to orthogonal components is not useful for inference unless one makes a few assumptions on the distribution and then the whole machinery rocks.
The author mentions in the preface that the only prerequisite to this book is high school algebra. Indeed after reading the book I think he is spot on. He uses differentiation at a few places but those pages are prefixed with author’s words that it’s ok to skip those pages. This is one of those rare books that take the reader from explaining basic stuff like “what the matrix looks like” to using it in a “statistical inference” of a linear model.
Many years ago I was told that Axler’s book on Linear Algebra is the one that needs to be read to get a good understanding. I followed the advice and realized that time that every matrix is nothing but a linear transformation. There is an input basis, there is an output basis and a transformation is represented by the matrix. However books on statistics use special matrices and operators. You need to connect ideas in matrix theory and statistics to understand stats well. In that sense this book is an awesome effort. In 12 chapters out of the 15 chapters, the author explains the various concepts that are used in stats. When a reader reaches last three chapters, more often than not, he/she is thirsting for more and I guess that’s a sign of a great book. Some of the equations you had always known appear to twinkle and shine brightly.
The book starts with a basic introduction and talks about the need for working with matrices. If you think about it, whatever field that you are working in, using matrices, you can state the problem independent of the size of the problem. It is also useful to organize computer techniques to execute those methods and to present the results. The chapter is basically an intro to the notations used throughout the book.
The second chapter gives a list of basic matrix operations. In doing so, the author introduces partitioning, transposing, inner product, outer product etc. The reader is encouraged to view matrix multiplication as a linear combination of rows and linear combination of columns. When you see AB, the product matrix, the first thing that I guess that should come to one’s mind is that rows of the AB are linear combination of rows of B and columns of AB are linear combination of the columns of A. Similarly premultiplying a matrix by a diagonal matrix is equivalent to scaling each of the rows by the diagonal entries. Post multiplying means scaling each column. This view point is crucial for understanding various canonical forms of a matrix. By using elementary row operations and column operations, you can transform a matrix in to a bare bone structure that is very powerful. If the operations are chosen well, this structure is a diagonal matrix from which one can cull a ton of information. I think if you work with R for some time, this kind of thinking becomes natural as every line of code you write in R, you are either using vectorized functions or building vectorized code.
The third chapter deals with special matrices, special because of their properties and also because of their appearance in statistics. Well, what are they?

Symmetric matrices: Covariance matrix, Correlation matrix,
,
are all the standard mathematical objects in statistics and these are all symmetric matrices.

All elements equal: Denoted by
this matrix is crucial for various operations like centering the data. If you have used the
sweep function in R, the background of such an operation typically involves this matrix.

Idempotent matrices: In stats, one comes across them in various places. If you project dependent data vector on to the subspace orthogonal to input vectors, you get a projection matrix that is idempotent. I think this is a standard place where one comes across idempotent matrix. Another place where it appears is in quadratic forms. Every line of output in ANOVA is attached a distribution. This is done by computing the distribution of the quadratic form and idempotent matrices are crucial in proving many aspects.

Orthogonal matrices – Well, a data analyst should make long lasting friendship with these matrices. In a world filled with noise, these friends help you see the data in the right way. Principal component analysis, Factor analysis, ridge regression, lasso, LARS all use orthogonal matrices in one form or the other. They are everywhere. Helmert, Givens transformation and Householder matrices are some of the special orthogonal matrices mentioned in the book.

Positive definite and Positive semi definite: If you have gone through the book by Gilbert Strang on Linear algebra, he makes it a point to highlight these matrices everywhere. Well rightly so, they do appear everywhere in statistics. Depending on the nature of the matrix, the scalar multiplication can be thought of as a quadratic or bilinear form.
The fourth chapter is on determinants. It is not so much about the computation of determinant that matters. Since Gaussian multivariate distribution, the most common distribution that one sees, has a determinant in its functional form, it is better to know some of the basic properties. More importantly one must quickly be able to verbalize the intuition behind a few things like, a determinant of an orthogonal matrix is either 1 or 1, and the determinant of idempotent matrix is either 0 or 1. Understanding the computation determinant from a cofactor viewpoint helps one connect the ideas with eigen value decomposition. i.e. Sum of eigen values is the trace of a matrix and Product of eigen values is the determinant. The fifth chapter deals with computing inverse using determinant and cofactor method.
The sixth chapter of the book is on “Rank’” of a matrix, a concept that is a defining characteristic of a matrix. Terms such as column rank, full column rank, row rank, full row rank, and full rank are introduced with a few examples. The idea of rank is connected with the idea of linear dependence between rows or columns of a matrix. I think Strang’s visual of four subspaces is wonderful in understanding the concept of rank. If there is a vector that gets transformed to a null space, then the matrix is a deficient matrix, i.e. column rank is not equal to row rank. No amount of theorems and lemmas can give the same effect as that of the “four subspaces visual”.
I think the highlight of this chapter is to show the full rank factorization of a matrix. This is also probably the first signs of the author hinting at the power of partitioning a matrix.
The seventh chapter is on canonical forms. There are various canonical forms mentioned in the chapter. All of them are fancy names for transforming the matrix in to a diagonal form by premultiplying and postmultiplying by elementary matrices. The first canonical form mentioned is the equivalent canonical form. This exists for every matrix. Take any matrix, apply relevant operations to its rows and columns, you can manufacture a diagonal matrix. This diagonal matrix is very special as it immediately tells you the rank of the matrix.
Why is this canonical form useful? Well, firstly it will help one compare the rank amongst matrices. Secondly this form readily tells us whether the matrix is full column rank or full row rank or full rank matrices. All these are crucial for the existence of right/left inverses for the matrix. For symmetric matrices, there is another canonical form that goes by the name congruent canonical form. There is also a canonical form under similarity that comes up in the chapter on eigen values. I think the highlight of this chapter is the proof that a matrix of rank r can always be factored in two matrices one with full column rank and one with full row rank.
The eighth chapter is on generalized inverses. Well, I have said enough about it at the very beginning of this post. All I can say is that this chapter gives you a firm grounding on generalized inverses.
The ninth chapter is on solving linear equations. The first concept introduced in the chapter is that of consistency. Any relationship in the LHS of a linear equation must be present in the RHS. Else, the equations are inconsistent. This chapter is one of the crucial chapters in the book that talks about the usage of generalized inverse to solve a set of linear equations. Once you find a generalized solution, you can manufacture any number of solutions based on the specific solution. So, the solution to the consistent equations, , with having columns and being a generalized inverse of , is given by
for any arbitrary vector of order . This means that you can write down the solution to as
for any arbitrary vector of order . From here on, the book touches upon so many ideas that are so crucial in understanding a linear model.
The tenth chapter is on partitioning. If you have been told that conditional density in a multivariate normal distribution is also a multivariate normal and you wonder why, then partitioning is one way to understand it. As an aside, if you are wondering what’s the big deal in partitioning a matrix, I think you can try this: Derive the conditional density of the variables given in the case where all the four variables are from a multivariate normal If you can derive with ease, then I think you are terrific. But if you are like me, rusty with such a derivation, then you might want to pay attention to partition matrices. Where do partition matrices help? Firstly, if you can spot orthogonal matrices in a big matrix, you are better off partitioning it to exploit the orthogonality features. Computing determinants is quicker once you partition. I think the biggest learning from me in this chapter was the way to look at inverse of a matrix. A matrix inversion formula using partitioned matrices is very useful in whole lot of places. These matrices inversion formulas are written using Schur complements. The chapter ends with a discussion on direct sum and product sum operators on a matrix. Well, kronecker delta is the fancy name for product sum. As an aside I have always replicated matrices as blocks in a sparse matrix to aid quick calculations with R. This replication makes a lot of sense when I realize that it is basically a product sum operator on the design matrix.
The next chapter in the book is on eigen values and eigen vectors. These are special vectors of a matrix that give rise to invariant subspaces. In goes a vector, out comes a vector that is merely scaled version of the vector, i.e. the transformation gives rise to an invariant subspace. The chapter gives some examples to show the computation of eigen vectors and eigen values. The highlight of the chapter is the diagonability theorem that gives the conditions for a matrix to be represented in the form . For a symmetric matrix, the eigen values are all real, eigen vectors are orthogonal, and they can be represented in a form, that goes by the name, canonical form under orthogonal similarity
The twelfth chapter deals with a set of miscellaneous topics that are all so important. In one sense by this point a reader is overwhelmed by the beauty of this subject. There are so many key ideas connecting linear algebra and statistics mentioned here. This chapter gives a brief summary of orthogonal matrices, idempotent matrices, the important matrix , summary of canonical forms, differential forms of various vector operators, vec and vech operators, Jacobian matrix, Hessian matrix etc.
After 350 pages of the book, the author starts talking about applications in statistics. So, the reader’s effort in going through the first 12 chapters is rewarded in the last 80 pages of the book. Chapter 13 explains about variance covariance matrix, correlation matrices, centering the matrix, Wishart matrix, multivariate normal distribution and quadratic forms. The last two chapters rip apart the linear model and build everything from scratch using the concepts from introduced in the book.