May 2012


image image image

Bradley Efron


Robert J. Tibshirani


When I first encountered “Bootstrap” algorithm and looked at its application , I was literally blown away. Here was a methodology that took the traditional statistics head-on. Armed with a computer and a basic algorithm for bootstrap, you can pretty much do any sort of analysis that you can find in traditional statistical world. Guess what, you hardly need to remember any complicated formulae. Somehow people who are used to traditional way of doing statistics do not like bootstrapping for various reasons. This is from my personal experience. I had developed a model that involved bootstrapping and it was a decent way to handle uncertainty. However in the recent times, the model went through a review and the reviewers were somehow averse to using bootstrapping. Well, I had to go with their decision, because sometimes it is better to get things done than to be in a perpetual state of debate. In any case, it was unlikely that I would have convinced them about the superiority of a bootstrapped approach. Having said that, my romance with bootstrapping is steady J. I love the way things can be done using basic bootstrapping algorithm. In fact these days whenever I end up solving something in a parametric way, I always double check whether bootstrap is giving similar results. If the results don’t match, I tend to cast a suspicious eye on the parametric method and trust the bootstrap method. There are very few books written on bootstrap, may be a handful even though the basic idea was introduced 30 years ago.

This book is a master-piece on bootstrap methods. What does this book contain? To answer that question, one must at least have a rough map of the various methodologies. This visual gives a nice overview of what’s in store for any reader of this book. If you look at the lower right quadrant, almost all the methods that are taught in the conventional statistics belong to this quadrant. You estimate something, you appeal to asymptotics, you make some parametric assumption and then you crank out stuff to do modeling. Take any stats oriented course at an random under graduate / graduate level course, typically the focus is on lower right quadrant, which I think is little sad because there are 3 other quadrants which warrant a good enough coverage. I think once any student is exposed to non parametric stats, it is unlikely that he will ever go back and trust parametric stuff. At least that’s what happened to me.


The book starts off by emphasizing the computing power that has made possible non parametric methods that were unthinkable before. The purpose of the book is to explain when and why bootstrap methods work, and how can they can be applied in a wide variety of real data-analytics situations.

“Bootstrap” etymology – The use of bootstrap derives from the phrase to pull oneself up by one’s boostrao, widely thought to be based on one of the eighteenth century Adventures of Baron Munchausen, by Rudolph Erich Raspe. The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.

In the context of statistics it means you use the data to make inferences about data without resorting to any assumptions. It is a computer-based method for assigning measures of accuracy to statistical estimates.

What are the typical measures of accuracy of statistical estimates?

  • Standard error
  • Bias
  • Prediction error
  • Confidence interval

I was pretty comfortable with the first two aspects mentioned. I went through this book to learn about a particular method of prediction error computation called “Cross Validation”. In a lot of non parametric methods, Cross Validation is a go to procedure. Take a simple example of histogram. At some point or the other, one typically comes across a histogram. If you are given a pen and a paper, and a set of 50 data points, and are asked to come up with a histogram, it is going to take some understanding and thought before you arrive at a histogram. These days any software like excel can make a histogram. The problem with point and click interfaces is that learning is superficial. I know I am taking a diversion here from the intent of the post, but I can’t help it. The way you choose to divide the data to various bins in a histogram is critical for a good estimate. How do you choose a bin? There is some theory behind it and one of the methods of bootstrap is used in estimation. So before you think that bootstrap is only for some complicated problems, the fact that it is used in as common as a histogram, says that the theory has become very powerful and is used in a wide variety of applications.

Ok, coming back to the book.

Chapter1 – Chapter 5

The book starts off giving some basic fundas on probability and statistical inference. It then devotes a chapter on empirical distribution function, the SINGLE MOST IMPORTANT ESTIMATE in the context of bootstrapping. For a long time I did not understand that sampling with replacement has a fundamental connection with empirical distribution function. The book then talks about plug-in principle that helps in doing bootstrapping in the first place. Plug-in principle is fantastic if you have no clue about the underlying distribution, which is often the case. Instead of assuming some convenient distribution, I would always go with plug-in principle. The book has an introductory chapter that explains standard error of estimate and estimated standard error of estimate. Thus the first five chapters of the book serve as good preparation / refresher for the reader so that he is equipped to read and work through the rest of the chapters. Let me attempt to summarize the chapters that I have managed to go over. Before that I must say that the fact there is an R package “bootstrap” that contains all the datasets used in the book has been of immense help. I guess reading and working through stats books has completely change post R revolution. In this case, you can get all the datasets , code that is associated with various results in the book. This has helped me to DO and WORK OUT the various concepts mentioned rather than merely READING up.

Chapter – 6: The bootstrap estimate of standard error:

We have a random sample x=(x1, x2,..) from an unknown probability distribution F and let’s say we wish to estimate a parameter of interest alpha = t(F). Let’s say we have an estimate of alpha, call it alpha.hat. One can compute this estimate by any method, a simple way would be the plug in estimate. Obviously this is a point estimate and one needs to have some understanding about the uncertainty around the estimate. This is where bootstrapping comes in to picture. It is a good estimation procedure to compute standard error of the estimate. The estimate of standard error requires no theoretical computations, and is available no matter how mathematically complicated the estimator is.

How does this work?

Firstly, you need some terminology to get going:

  • Bootstrap sample /Resampled version – Sampling with replacement from the realized sample
  • Bootstrap replication of theta – Computing the statistical functional, theta based on the resampled data.
  • Ideal bootstrap estimate of standard error of theta – Standard error of the bootstrap replication of theta. This is under infinite bootstrap samples.
  • Sample bootstrap estimate of standard error of theta

If the data has n data points, you generate a data sample of n points with replacements, called the resampled version, compute the estimate, called bootstrap replication of theta. Now if you do this infinite times, you get ideal bootstrap estimate of standard error of theta. However in reality, one resamples data for about 200 times and then calculates the sample standard deviation of B replications. Why are 200 bootstrap samples good enough? Obviously by doing only 200 replications, there is more variability than doing infinite replications. But how much more is it ?I had this question for many years now.

Well to understand why to stop at B replications, one needs to compute the coefficient of variation statistics. The book gives a formula that connects coefficient of variation of B replications with that of infinite replications. As number of replications go up, the bootstrap estimate of standard error of theta comes closer to ideal bootstrap estimate of standard error of theta.

One can also do a parametric bootstrap where instead of sampling with replacement from the realized data, you sample from the parametric distribution that is assumed for the realized data. When you write down stuff, you understand things much better, at least that holds for me. For some people when they talk they understand it better. In my case, if I write down stuff, I remember for a longer time. That’s the reason for writing these summaries. Also I hope that some of my blog readers benefit by reading my thoughts and observations.

Lets say you have 15 data points of x and y. You want to calculate the standard error of correlation between x and y. In the non parametric bootstrap case, you do sampling with replacement, generate a bootstrapped sample, calculate the correlation B times, compute the standard error of the bootstrapped sample. In the case of parametric bootstrap, you assume that the data comes from a bivariate normal , estimate its covariance matrix and then generate B bootstrap samples from this parametric estimate of covariance matrix, compute the correlation and compute the standard error.

To calculate the standard error for correlation estimation, one can do this for nonparametric bootstrap

sd(replicate(200,cor(dataset[sample(1:15, 15, T ),])[1,2]))

Or one can do parametric bootstrap

sd(replicate(200,cor(rmvnorm(15, mean(dataset), cov(dataset)))[1,2]))

The former works with empirical distribution whereas the latter works with an estimated distribution, i.e multivariate normal.

Something that struck me after reading this chapter – Why do textbooks assume distributions ? Because most of them want to derive some closed form solution. They cannot say that you iterate / bootstrap the data and you get an estimate. Since you cannot get a standard error for whatever estimate you are working on, the best resort is to apply asymptotics and say that as n approaches infinity, the estimator approaches a well known closed form distribution such as normal, t, student-t, beta etc and then use the closed form to produce a nice looking formula for the standard error of the estimate. However there is always a problem in relying on such standard errors as the limiting distribution is an abstract thing given the data that is available with you.

The bootstrap has two advantages over traditional textbook methods

  1. When used in nonparametric mode , it relieves the analyst from having to make parametric assumptions about the form of the underlying population
  2. When used in parametric mode, t provides more accurate answers than textbook formulas, and can provide answers in problems for which no textbook formulae exist

There is a nice exercise mentioned towards the end of the chapter which is about computing the standard error of trimmed mean. The good thing about this example is that it shows you that you can work with sampling with out replacement also. You sample with out replacement and generate unique combinations of data, assigning probability to those samples and calculate the standard error of the trimmed mean. It is computationally intensive as it requires that you generate all permutations of the data. In fact when I did using R, it took painfully long time. So, switched to python and obtained ideal bootstrap estimate of standard error. Besides learning about stats, this exercise made me realize that in the future I will be using more and more of python to code up models.

Chapter 7: Bootstrap Standard errors: Some examples

Think of any estimate, be it a simple statistical functional or a complicated function, there are parameters to be estimated and the first question that anyone would be interested is, “what is the standard error of the estimate?”. This chapter talks about the computation of the standard error of the estimate for 2 specific cases, one involving eigen vectors and second involving loess.

One of the basic techniques of working with multivariate data is do a PCA that involves eigen value decomposition. Once you crunch the numbers and get the eigen values and eigen vectors,some of the questions of interest are ,”What is the proportion of variance explained by the first eigen vector?”, “What is the variability of the eigen vectors ?” One can use the same bootstrapping algo to get a standard error of variance explained by first eigen vector and the standard errors of eigen vector components. Now why would one be interested in such a thing ? If you see that that eigen vector components are showing higher variability, then it is likely that there are some extreme data points that are influencing the analysis. This type of analysis is painstaking and often erroneous if you rely on standard parametric statistics. With bootstrap methods, all you need is a computer.

This example is about LOESS applied to about 82 data points where simple regression or quadratic form of regression does not capture the correct relationship. Using loess, one can see that the bias goes down and variability goes up.

The chapter ends with a discussion on the cases where bootstrap fails. Whenever there is situation where the empirical estimate of distribution function is not a good estimate of the true distribution, then everything goes haywire. Using a simple example of MLE estimate on uniform random sample, the section shows that bootstrapping fails majestically. So one must be careful in using bootstrap too.

Chapter 8: More Complicated Data Structures

This chapter moves on to talking about bootstrapping in the context of multiple samples. The basic algo is the same but the data structure that is used in the algo needs a little tweak as well as the approach. In the case of multiple sample data, instead of simulating from a single empirical distribution function, you simulate data from a probability function that is dependent on the empirical distribution of its components.

There is a section on using bootstrapping to compute the standard error of coefficients in AR(1) and AR(2) model. The idea again is simple to understand. The residuals data is the realized sample that one has. You sample with replacement from these residuals and construct different time series and estimate the coefficients. These estimators for bootstrapped samples can be used to compute the standard error of the coefficients.

What I learnt from this chapter is moving blocks bootstrap, an intuitive method of generating bootstrapped samples from time series data. One of the key aspects of time series data is the correlation structure and by chunking the data in to blocks and resampling from the blocks, you see to it that the correlation structure is maintained and at the same time you get a resampled data to do bootstrapping. Pretty neat concept that I did not know before reading this book.

Chapter 9: Regression Models

Linear Regression is one of the major workhorses in statistics. Indeed there are a ton of assumptions that go with conventional linear regression framework. What does bootstrapping have to say relating to this framework? Well, prior to the authors showing the application of bootstrapping to complicated methods, they start off with applying it to a simple linear regression involving a dependent variable, a fixed covariate and an intercept term. Broadly there are two types of bootstrapping concepts

  • First method is where you estimate the coefficients of the regression, compute the realized residuals, bootstrap samples from the realized residuals, recomputed the dependent variable, regress the bootstrapped dependent variable with the covariate to get a fresh estimate of parameters. Do this about 200 times and you get enough data on parameters to compute the standard error of the parameters
  • Second method is straight forward, has lesser assumptions. Assume your data is in the form of a matrix where each row comprises data relating to independent and dependent variables. You sample the rows with replacement and run the regression framework to get regression estimates. Do this for about 200 times and you get enough data on parameters to compute the standard error of the parameters

Which one is better? Well, on the face of it, the second method looks better as it does not care whether residuals have some dependency structure involved. However if you have some prior information on the way residuals depend on the covariates, then first method might be suitable.

Came across this statement while browsing randomly, which I thought was elegant in its description of bootstrapping

The logic of the bootstrap procedure is that we are estimating an approximation of the true standard errors. The approximation involves replacing the true distribution of the data (unknown) with the empirical distribution of the data. This approximation may be estimated with arbitrary accuracy by a Monte Carlo approach, since the empirical distribution of the data is known and in principle we may sample from it as many times as we wish. In other words, as the bootstrap sample size increases, we get a better estimate of the true value of the approximation. On the other hand, the quality of this approximation depends on the original sample size and there is nothing we can do to change it.

My biggest learning from this chapter is Least Median of squares. The funda is simple but implementation is tough. Instead of minimizing the average of least square errors, you minimize the median of the least square errors. While going through this material, I thought I can merely use some basic optimization technique and get the results. But I was wrong. I have learnt that there is a better and a widely accepted method of computing the estimates of least median of squares. I have to figure out why the method works, but here is the basic structure of the method

  • Based on the number of parameters in your model that includes intercept, let’s say p, select a random sample without replacement of size p, fit a line or a plane and then compute the relevant coefficients. Do this many times and you get enough data on sample coefficients that you can compute the mean and standard error of the coefficients in the least median of squares framework. MASS package has a function called lqs that does this and has much more functionality. I tried going over the source code of the function and found it overwhelming. Will go over the code someday at leisure. For now, I coded up a function in my own way which seems to be agreeing with the estimates given in the book

I have also learnt about the concept of breakdown of an estimator from this chapter. Concepts such as these are extremely important while computing robust statistics.

Chapter 10: Estimates of Bias

Any statistical functional that we are interested in needs to be estimated from the data, i.e we end up using a certain function of the realized data to estimate the statistical functional. Obviously the function, called the estimator need not be an unbiased estimator. It is often interesting and good to know the bias associated with an estimator. One point that is mentioned at the end of this chapter that seems to dampen my enthusiasm towards using the methods in this chapter is , “ The exact use of bias is often problematic”. So, why does one need to know these methods for bias estimates ? May be it is nice to know knowledge and is just a piece of abstract knowledge that I learnt. In any case let me briefly summarize the contents of this chapter. It starts off with a basic definition of bias of an estimator, and then uses plug in estimate and bootstrapping to estimate the bias. Here things are a little tricky. We are estimating the bias of an estimator that is mouthful. Sometimes stats can be wonderfully confusing. Here is something in the context of this chapter. The chapter goes to estimate the bias of an estimator and then calculates the variation of the estimate of the bias of an estimator!!!. I had to code it in R to understand stuff. Basically the chapter shows three procedures to estimate the bias of an estimator

  • Plugin estimator -The regular method of bootstrapping samples
  • A better estimator of the bias of an estimator using a variant of bootstrapping
  • Jackknife estimator

The first time I came across Jackknife estimator was in the non parametric stats book by Larry Wasserman. The formula for Jackknife estimator at the first brush sounds like as though there has been some typo. The usual sample variance term is multiplied by a term that involves n. I stumbled on to this formula when my jackknife code was giving crazy results. After spending couple of hours on the program and checking for bugs, I realized that my jackknife standard error formula itself was completely wrong. Thanks to this book I understood Jackknife estimation procedure well. My takeaway from this chapter is that it easier to estimate standard errors using jackknife and bootstrapping procedures. Estimate of the bias of an estimator might be biased. I can’t believe I am typing this statement that makes perfect sense now but would have left me clueless had I not read this book.

Chapter 11: The Jackknife

This chapter was easy to understand as , in the past, I had struggled a lot while incorrectly using this estimation procedure. The most important aspect of jackknife that I had failed to take notice was the inflation factor that is present in the bias and standard error estimates via Jackknife. The chapter does a good job of showing the importance of inflation factor through a nice set of illustrations. The chapter also mentions about Psuedo values and says that it is still unclear whether or not pseudo values are useful in understanding Jackknife. There are also nice examples to show the Jackknife works on estimators that are basically in linear in nature. For a correlation estimate, jackknife procedure gives more standard error than the bootstrapped estimate. So, the basic advice is use jackknife when you are reasonably certain that the parameter of interest is linear in nature. Does Jackknife work always? Using the example of sample median, the chapter shows that jackknife does not give a consistent estimate as the median is not a smooth function of data. A remedial procedure called delete-d jackknife is mentioned towards the end of the chapter, which the author says is still an area under research.

Chapter 12: Confidence intervals based on bootstrap tables

The highlight of the chapter is bootstrapped–t interval. The initial section covers the basics of confidence intervals and hypothesis testing. One of the basic methods to form a confidence interval around an estimate is to use appeal to asymptotics and CLT to get the appropriate values from Normal distribution as upper and lower cutoff points. Bootstrapped-t interval is an improvement on this procedure where it relies on data to speak up rather than presuming some distribution in the first place. The funda behind it is simple. You generate let’s say 1000 bootstrap samples from the data, for each of the bootstrap samples, do a nested bootstrap to compute standard effort, that will enable to compute z for each of the bootstrapped samples. With these set of z values look up for 50th and 950th data point to get the cut off points. Based on these cut off points, one can form confidence intervals. The good thing about this method is that it leads to asymmetric confidence intervals and thus leading to better coverage probability. The section ends with an improvement on the bootstrapped-t interval that relies on the transformation of the data.

Chapter 13: Confidence intervals based on bootstrap percentiles

Let’s say you have a random data sample from a standard normal distribution and you are interested in a specific parameter, say. Exponential of the mean of the sample. There are two ways to go about it. Firstly using the plug in estimator , you already know the point estimate of the parameter. To form confidence intervals, you need the standard error of the parameter. You can bootstrap the data , compute the parameter for B times. Once you have these B estimates, you can compute the standard error. You can compute the Normal based percentile intervals, i.e parameter +c(-1,1)* 1.96*standard error. Second method is where you directly work with the bootstrap samples, calculate the 2.5 percentile and 97.5 percentile. Which method is better ? This chapter says that the first method is convenient but not always consistent. It says that second method is much better as it takes in to account that underlying distribution for the parameter might not be normal.

Implicitly , the second method based on bootstrap percentiles knows the right transformation so that the resultant data is normal. I did not understand this part until I wrote some code in R. The code clearly showed that percentile methods work better than normal based methods and using percentile methods does not mean that we should know the transformation function. Couple of nice features of percentile based methods are 1) it is range preserving and 2) it is transformation respecting. For all those people who dread remembering formulae for confidence intervals, this chapter is a big relief as bootstrapping gives you good estimates of the parameter’s cut off points for a given alpha. Although one must not mistake this method to be like nirvana as it does not correct the bias your estimator. I worked out an exercise problem of this chapter and found that out of the normal based interval, percentile based interval and t-based interval that was introduced in chapter 12, the t-based interval to have a better coverage probability than the other two estimators. Also came across a blog post that seems to be agreeing to what I say here.

Chapter 14: Better bootstrap confidence intervals

The problem with bootstrap-t method is that it is erratic despite its better coverage. The percentile interval approach is not erratic but it does not have a good coverage. To get over these problems, the chapter introduces two different methods – Bias corrected accelerated method and approximate bootstrap confidence intervals. Source code for both the functions is present in the R package associated with this book. At the point of reading this book, I realized that I should move ahead instead of knowing exactly what these functions do. I understand the functions at a 1000 ft view but I do not want to do a PhD in bootstrapping. I will use these methods in the future whenever I need to come up with confidence intervals.

Chapter 15: Permutation Tests

Permutation test was introduced by Fischer in 1930s as a theoretical argument supporting t statistic. When the sample size is less, then one can compute the exact test, i.e partition the dataset and compute the statistic for each of the partitions, and you can infer from the partitions. However if the number of data points is huge, then permutation tests become computationally expensive. In such cases, Monte Carlo techniques are used in combination with permutation tests to test various hypothesis. To find the combinations, I have used the combinations function in gtools package. For reasons that I can’t understand, I spent quite sometime implementing the permutations algo , which in the hindsight looks so simple. In any case, the technique is a valuable addition in any data analyst’s toolbox

Chapter 16: Hypothesis testing with bootstrap

In any kind of hypothesis testing, one needs a test statistic as well as distribution of data under the null hypothesis. The test statistic need not be an estimate of a parameter. It can be any general function of data. As far as the distribution is concerned, you can do a plug-in style of distribution estimation. The chapter works on a sample data set and show how the standard hypothesis testing can be substituted by a bootstrapping based hypothesis testing. One big difference between permutation test and bootstrapping based test is that in the former case, one can only test the equality of distributions between two samples whereas the bootstrapping can be used to test means, median,any test statistic comparison between two samples.

The basic framework for one sample testing is follows(given data points (x1,x2,,,) :

  • Translate the data: Create a new sample by subtracting the realized sample mean, and then add back the specific value mentioned in the null hypo. So this is the key step. You do not use the sample empirical distribution as it is. You translate it and then use it for boostrapping
  • Generate Bootstrap data sets from this new translated data
  • Calculate the t values and find the number of t values where it is greater than the observed t value. Based on the percentage of t values that are greater than observed t value, you can reject or accept the null hypo

The most interesting example of the chapter is mentioned at the end of the chapter where the null hypo involves testing the modes of the histogram of a population. For this kind of hypo, you cannot use the standard tests as there is no closed form text book type solution readily available. The chapter shows the power of bootstrapping by testing the multi modality hypothesis of a toy data set.The biggest takeaway from this chapter is that one should not use the sample empirical distribution, but translate the empirical distribution for carrying out hypothesis testing.

Chapter 17 : Cross-validation and other estimates of prediction error

After slogging through 230 odd pages of the book, I got to read and understand the technique that was used in density estimation technique in Larry Wasserman’s book. I found a reference to Cross validation estimate of error in a basic example of histogram estimator, one of the simplest density estimators. For choosing the bin, this technique is used in explaining the concepts in Larry Wasserman’s books on parametric and non parametric statistics. I could not proceed and have a surface level knowledge of this estimate and hence decide to dive deep in to this classic text on bootstrapping. Has my understanding improved about Cross-Validation after working through 230 odd pages of the book?

I think the concept is easier to understand when you are in a flow, i.e starting from the very basic of bootstrapping concepts, the book explains various measures of statistical accuracy, standard errors, biases, and confidence intervals. All of these are measures of accuracy for parameters of a model. The logic step for exploration is to measure the accuracy of the entire model.

This is where Cross-validation comes in to picture and it fits in the flow. The concept is easy to visualize. It is difficult to implement but thanks to the speed and computing available, it is within the reach of everyday analyst. In fact Cross-validation predates the bootstrap method introduced by Efron in 1978. The biggest problem with all the reported statistics for predictions in the standard statistical output summary is that they all capture “apparent error”, i,e you build a model on a specific data, you test the model with the same data and report the error estimate. Obviously it is an optimistic estimate. Consider all the usual statistics that are reported in the name of correcting this “apparent error”, i.e correction for number of degrees of freedom, AIC estimate, BIC estimate, etc. All these prediction error estimates inherently depend on a few things

  • No of parameters in the model is known
  • An estimate of residual variance is needed

In most of the complicated models, it is difficult to compute the above stuff with out making assumptions. In the end , as a practitioner you don’t want to use these closed form text book kind of prediction error estimates. This is where the method of Cross-validation is extremely appealing. When you look at the gamut of estimates that are usually reported and then look at the appealing feature of cross-validation, you immediately get the point. I think this kind of understanding would have been difficult, had I merely read this procedure and moved on to other parts of non parametric estimation in Wasserman’s books. I am happy that I could work through 17 chapters of this book and have understood things in a better way. Well, not all is great about this Cross-validation estimate. Even though it is unbiased, it has large variability. So if you want to cut down on variability, you can use one of the methods mentioned towards the end of the chapter, i.e simple bootstrap method. The basic funda of the method is to get a quantitative estimate of the optimism that comes with using “apparent error” and correct that optimism.

Chapter 18 : Adaptive Estimation and Calibration

There are two examples mentioned in the chapter that explain adaptive estimation and calibration concepts. Firstly, what is adaptive about estimation ? Well, think about a histogram estimator where the bin width controls the bias and the variation. So, a procedure that estimates the histogram while adapting to the binwidth is an example of adaptive estimation. The book uses the example of non parametric regression , a cubic spline smoothing estimate to show that the procedure it follows is an adaptive estimation procedure. For various levels of smoothing parameter, an error function is minimized. So, it is basically an optimization + estimation problem that one gets to use in adaptive estimation. In the case of calibration, the book uses the example of estimating the lower confidence interval cutoff point to show that the calibration algorithm. Personally the calibration algo is a nice to know stuff for me. However adaptive estimation is something I will be using often. In fact all the R packages based on non parametric regression has some sort of adaptive estimation procedure built in to it.

Chapter 19: Assessing the error in bootstrap estimates

Like any estimate bootstrap estimate also has an inherent error. The variance of bootstrap estimate comes from two sources, sampling variability, due to the fact that we have only a sample of size n rather than the entire population, and bootstrapping resampling variability, due to the fact that we take only B bootstrap resampling variability, due to the fact that we take only B bootstrap samples rather than an infinite number. For calculating the standard error of an estimate, you need about 200 bootstrap samples, for quantile estimation, you need about 500-1000 bootstrap samples. So, how does one go about taking these decisions. This chapter discusses the connection between ideal bootstrap estimate of coefficient of variation and the coefficient of variation of the B bootstrapped samples. The relation depends on the standardized kurtosis of the parameter being estimated. My biggest takeaway from this chapter is ,”Jackknife-after-bootstrap” method. Let’s say you have an estimate and you use bootstrap to calculate the standard error of the estimate. Now you might want to calculate the standard error of the standard error of the estimate!. Basically one wants to know the variance of this estimate. Jackknife-after-bootstrap is a method where you use Jackknife on the bootstrapped samples that have been generated to calculate the standard error. This is a very useful method if you want to figure out the variance of the bootstrapped estimate, it need not be standard error, it could be applicable to any estimate, that’s the utility of this method.

Chapter 20: A geometric representation for the bootstrap and jackknife

This section gives a visual that will I guess is the most appealing thing that I have read in the entire book. Bootstrapping, i.e sampling with replacement can be also represented in a probability format. The realized data is where each data point is given the same probability 1/n. In the case of n=3, the centroid represents the realized data. In the following figure , the empty circles represent the jackknife samples and the solid circles represent the bootstrap samples.


So for various probability vector realizations, one gets a surface over this simplex. This following visual captures the funda behind bootstrapping and jackknife


This kind of visual representation of procedures is priceless. The estimate of jackknife standard error approximates to linear approximation of the surface


As you can see that the jackknife defines a hyperplane of the statistical functional surface. There is a nice visual provided for infinitesimal jackknife estimator also.


In this case, the curve is a tangent to the original surface. The basic takeaway from this chapter is the visual representation of the techniques and its implication on the standard errors calculated from each of the methods. The chapter concludes saying that, ordinary jackknife will tend to give larger estimates of standard error than the infinitesimal or positive jackknife.

Chapter 21: An overview of nonparametric and parametric inference

The previous chapter talked about the relationship between jackknife and its variants and bootstrap. This chapter talks about the relationship between such procedures and parametric methods like MLE etc. A nice visual at the very beginning summarizes the various inference procedures that are discussed in this chapter.


Influence function is introduced to show the connection between the elements in the top row, ie. non parametric exact bootstrap and nonparametric approximate methods. All the approximate methods finally boil down to choosing the influence function.

I learnt something that I had never known till date, i.e using non parametric methods for maximum likelihood estimation. I was thrilled to find that such a procedure exists. I always liked MLE but was little put down by the fact that I need to make assumptions. However this non parametric MLE sounds too good. Need to explore in the days to come. I checked out the book by Yudi Pawitan that deals with this concept at length. I will go over the book someday.

The bottom left block deals with parametric bootstrap that is an exact test. The funda behind it is : sample from a known distribution, find the statistical functional B times and then take the standard error of those B values. The block on bottom right is typically introduced in most of the stats101 courses. The usual MLE and calculation of the variance of the estimate is done using Fischer Information. The relation between parametric and nonparametric inference is shown in a multivariate case where the score function and influence function have a close connection. The Fischer information method for estimation of variance can be thought of as an influence function based estimate using a specific model for influence function. The chapter also explores the relationship between delta method and infinitesimal jackknife method.

There are 5 other chapters that I have skipped in my first read of the book as I felt that I was overwhelmed by so many concepts relating to bootstrap that I had to take a break. I will probably reread this book at a later point in time and try to summarize the remaining 5 chapters.

image Takeaway :

The book is a classic reference on bootstrap. You will be hard-pressed to find any other book that tackles all the aspects of bootstrap at one place and comprehensively. It is a must read for any number cruncher as it is extremely relevant in this age of big data. Having a nonparametric frame of mind is becoming quintessential for doing good analysis.


Statistics stands on two pillars, estimation and inference. Pretty much anything you work on stats, you end up either estimating something or inferring something. If you take a random sample of people who have taken a stats101 course at some point in their lives and ask them what was the course all about , a most likely answer would be, "It was something to do with p-values". Statistics at it core is about comparing a set of numbers with each other , with theoretical models and with past experience. But most of the introductory textbooks contain scary formulae and distribution tables that need to be used by students. I think if you are a teacher introducing statistics to a new batch of students, it will do a world of good, if you dramatize a specific act : Walk in to the class and tear all the pages in the appendix that have these tables and arcane formulae that only scare people out of developing a statistical mindset. It will at least drive home the point that there is no formal textbook to interpret real life data. Why do you think textbooks make assumptions about the distributions of the data ? Pause for a few seconds and think about it. Well, one of the main reasons is that unless you have some assumptions, you cannot fill up the textbook with neat formulae. Yes, think about it. Unless you assume a certain distribution, you cannot put a neat formula for estimate. You cannot put a neat formula for confidence interval and so on and so forth. What’s the use of those formulae ? Not much.

As an example, let’s say you record your evening commute times from office to home, daily for a month or two. You want to see whether the your average commute time on Monday is less than Friday.What would you do ? Well, you can calculate the mean of the commute times on both days. What you are doing is estimating the average time ? What you intended was inference , i. whether your commute time on Monday is less than Friday’s?. If you open up a stats book, it will tell you to use some formula that involves mean of the commute times,pooled standard deviation etc. From that you form a t statistic and check whether the respective p-value for the statistic is less than 0.05. The whole procedure that is mentioned in the book is brimming with a ton of assumptions ? Why should one assume same variance across two samples? Ok, if you don’t assume that, there is another formula where you can make that adjustment?. Either use this formula or that formula..argg..There is a fundamental problem with this approach.The use of parametric statistics to determine the critical value at which observed t becomes significant. In these days of abundant and cheap computing power, nobody follows such textbook approach anymore. One uses nonparametric distribution free methods for as simple a test as mentioned above. This was unthinkable a few decades ago when computing power was expensive, when software was expensive. You can generate let’s say , for the above example, 500 resampled data and get a practical answer to the question. In todays world of free open source software, it would be unthinkable if someone were to open a textbook in trying to answer the question mentioned above.

Books such as these are helpful to the general public who don’t run statistical analysis in their day to day work, but have to interpret statistical results in their professional or personal life. The book does not have a single formula but it tries to impart knowledge more than most of the boring and sleep inducing textbooks that one comes across The book tries to weave some important stats101 concepts by telling 34 stories, stories that one can easily read and remember the associated lesson with it. Stories are always a great way to teach/learn/understand stuff. Let me attempt to list the concepts that these 34 stories cover:

  1. Basic difference between estimation and inference
  2. In some situations mean is better than median , while in others it is vice-versa
  3. For Skewed data, median and interquartile range summaries gives a better understanding of the data than mean and standard deviation
  4. What is skewness and how do you identify the same in the data?
  5. Mean and Standard deviation tell you everything about the data(if it comes from a normal distribution). Technically they are called sufficient statistics. Histograms do the job where mean and median are not enough.
  6. Story that hints at the utility of nonparametric regression
  7. Normal distribution and its parameters
  8. If data doesn’t look normal, you can take a log transform and work with it. Why does log transform make the data normal in many situations ? Verbalizing it with out any formulae is where the stories do a good job.
  9. How does one check whether data is from a normal distribution ?
  10. What does standard error of an estimate mean ? Well, actually modern non parametric statistics has algos to figure out standard error of the standard error of an estimate. That’s a mouthful , but there are situations where it matters
  11. What do you understand by a confidence interval ?
  12. What is a statistical tie ?
  13. How does one verbalize p-value ?
  14. Story of a dry toothbrush to illustrate the basic funda of p-value
  15. What does one mean by Null Hypothesis ?
  16. Difference between a t test and Wilcoxon test. What are parametric statistics and non parametric statistics ?
  17. Concepts such as sample size, precision , statistical power, Type I and Type II errors
  18. Story to convey the usage of univariate regression, multivariate regression and logistic regression.
  19. Multiple regression is not a magic wand that you can use to churn out models. It has a ton of assumptions and the more you are aware of them, the less crap you dish out/ less crap you take from the news/articles/papers that have a statistical garb.
  20. When a child coughs, the mother overreacts and father underreacts ? What are the consequences ? A story that illustrates the concept of specificity and sensitivity of a test,i.e probability that the test shows positive/negative given the patient has disease/no-disease. If a doctor has to discuss a test result with a patient, specificity and sensitivity are of not much help. One is supposed to talk about probability that the patient has disease/no-disease given the test shows positive/negative, called the positive predictive value and negative predictive value.
  21. How does a decision tree work ?
  22. A story that analyzes the blunders one sees in academic papers that report a barrage of p-values for everything
  23. Weeding out unnecessary p-values – A paper that involves testing 150 odd patients and that ends up reporting 126 separate p-values, i.e almost one p-value for a patient! Heights of lunacy
  24. A story that drives home the point that chi-square and ANOVA provide inference and not estimation. Similarly correlation provides estimation and not inference. Any statistical test that provides p-values is basically an inferential procedure and one cannot use it for estimation
  25. Words of caution in the context of regression
  26. Explaining the idea of "Regression to the mean"using a simple story
  27. Misapplication of conditional probability in O.J Simpson’s case,Sally Clark’s case
  28. Dangers of multiple hypothesis testing
  29. A story that reiterates that p-values are about inference and not about estimation
  30. A story that talks about statistical errors creeping in because of not "starting the clock" simultaneously for test and control group
  31. There is no right way of doing statistics. It is all problem,purpose, context dependent
  32. Importance of reproducibility in statistics. This is a topic that is dear to my heart. May be 5-10 years back there was no good infra to do it. But now thanks to amazing efforts of R community and Python community, creating reproducible research has become easy.
  33. Statistics has to link math to whatever field you are working. This linkage is what makes working in stats such a fun thing. Here is a wonderful illustration that is spot on about the real world and statistical world. Universities, Schools, Textbooks focus on the right side of the picture. They teach you all the math, probability, models to equip you enough, so that you can go in to the real world, and make all the connections.
  34. Statistics is about people. Even though it involves working with data, after all the results that you obtain ultimately affect people. Quoted here is the story of John snow and his statistical work that world got a handle on cholera epidemic.


image Takeaway :

p-value is the probability that the data would be at least as extreme as those observed if the null hypothesis were true. The author takes the reader through a set of 34 stories to explain various concepts around p-value with out using any math. Somewhere in the book he says,

Good Statistics is a bit like a pair of high quality stereo speakers. It allows you to hear the data clearly with out distortion. Yet the best speakers in the world isn’t going to make a CD sound good if the music was badly played or recorded.

Well real world data is messy and chaotic. Statistics via its high quality stereo speakers voices out all the screeches, scratches, crackles, and occasional melody. This helps in reducing Type I and Type II errors. Books such as these help in clearing out the unnecessary math / formula garb that one encounters in various Stats101 books.