July 31, 2012
Posted by safeisrisky under Books
Leave a Comment
The book begins with a single mother murdering her ex-husband. To create the perfect alibi and lead every police investigation in to a dead-end, her neighbor, a quiet high school math teacher, offers to help her. The math teacher almost succeeds in doing so, and nobody figures out the real story behind the murder except, a physics professor, whom the detectives seek help from.
The most engrossing thing about this book is that you are given the murder, i.e. the complete picture at the very beginning of the book and you encounter various characters in the novel who try fit together the pieces of the jigsaw puzzle So you always kind of, follow the intuitions, deductions, inferences of various characters about the murder and ultimately you can’t help feeling lost too. You start wondering, whether it is possible to implicate the murderer at all? But then the physics professor cracks it and puts it all together. The final picture that emerges out of the jigsaw puzzle is something that is totally unexpected.
Thoroughly enjoyed the book.
July 29, 2012
I stumbled on to this book way back in September 2010 and had been intending to work on frequency domain aspect of time series since then. I am embarrassed to admit that almost 2 years have passed since then and it was lying in my inventory crying to be picked up. Why did I put off reading this book for so long a time? May be, I am not managing my time properly. In any case, I finally decided to go over this book. In the recent few weeks I have spent some time understanding Fourier Transforms (Continuous and Discrete). I was thrilled to see so many connections between Fourier transform and Statistics. It was like getting a set of new eyes to view many aspects of Statistics from a Fourier Transform angle. Central Limit theorem appears “wonderful” from a convolution standpoint. Density estimation becomes so much more beautiful once you understand the way FFT is used to compute the kernel density, etc. The simple looking density function in any of the statistical software packages has one of the most powerful algorithms behind it, the “Fast Fourier Transformation”. I can go on and on listing many such aha moments. But let me stop and focus on summarizing this book. With Fourier math fresh in my mind, the book turned out to be easy sailing. In this post, I will attempt to summarize the main chapters of the book.
Chapter 1: Research Questions for Time-Series and Spectral Analysis Studies
The chapter starts off by giving the big picture of variance decomposition of a time series. In a time series, one first tries to estimate the variance explained by trend component. From the residuals, one then tries to estimate the variance explained by the cyclical component. Besides trend and cyclicality, there may be other components in the time series that may be culled out by studying the residuals. With this big picture, the chapter then talks about using sinusoid model for cycles. Thanks to Joseph Fourier and a century of mathematicians,physicists, researchers who have worked on Fourier series and transforms, we now know that Fourier series and Transforms can be used to represent not only functions in L2 Space but a much wider scope of functions. The chapter ends with the author saying that her focus of the book is on the cyclical component of the data, a component that some of the econometricians remove before analyzing data. It is common to see that time series analysis examples decompose the time series in to trend, cyclical, seasonal and idiosyncratic components and then work solely with the idiosyncratic component. However this book is geared towards the analysis of the cyclical and seasonal components after removing the trend component in the time series. Thus the first chapter makes the focus of the book precise and thus setting the tone for the things to come.
Chapter 2: Issues in Time-Series Research Design, Data Collection, and Data Entry: Getting Started
The principles that this chapter talks about are relevant to any research design study. For a researcher who is interested in doing spectrum analysis, the following are the three aspects that need to be kept in mind :
Length of the time series
Choice of sampling frequency and
Should the sampling be time based or event based.
Out of these three aspects, choice of sampling frequency is subtle issue. Well, one can always intuitively understand the fact that high frequency waves masquerades as a low frequency wave, but a simulation always helps in understanding things in a better way.
The above visual shows two signals, the red with a higher frequency and the blue with a lower frequency. If one under samples the red signal, it is possible that the analysis throws up the frequency associated with the blue signal. This simple example shows that sampling choice is a critical component for avoiding such errors
Chapter 3: Preliminary Examination of Time-Series Data
This chapter talks about techniques to identify trend stationary in the time series. One important takeaway of this chapter is: If you are planning to do any spectrum analysis on a time series data, it is better to remove trend component. If you do not remove the trend component, the trend produces a broad peak at the low-frequency end of the spectrum and it makes it difficult to detect any other cycles. The author uses airline usage dataset, a widely used dataset in time series literature to illustrate various concepts such as
Detrending a time series using differencing
Use of OLS for detrending
Box-Ljung Q test to check the presence of correlation for a specific number of lags for the detrended data
Durbin Waston test to check the autocorrelation of detrended data
Levene’s Test of Equality of Variances on detrended data to test stationarity ( the function in car package is deprecated. One can use levene.test from lawstat package)
One of the basic criteria for going ahead with Spectral analysis is the presence of variance in the data. If there is hardly any variance in the data, then trend and spectral component might be a stretch in understanding the data. So, that’s the first attribute in the data that makes a case for variance decomposition using Spectral analysis.
Chapter 4: Harmonic Analysis
This chapter talks about a situation where the data analyst is more or less certain about the periodicity of the signal and wants to estimate the amplitude, phase and mean for the wave. This chapter talks about the procedure to perform Harmonic analysis for detrended data. Harmonic analysis is a fancy name for multiple regression using sine and cosine variables as the independent variables. In Harmonic analysis, since you already know the period for the signal, you can easily set up a regression equation and find out the amplitude, the phase and the mean of the signal. The biggest assumption behind the procedure mentioned is the fact that you know the period of the signal. Sadly, it is almost never the case in analyzing real life data. There are certain issues one needs to handle before doing harmonic analysis; some of them are as follows:
Period T is not known apriori. Periodogram or Spectral analysis needs to be done
Spectral analysis might show some dominant frequencies and one might have to take a call on the set of frequencies to select for harmonic analysis
In the presence of outliers, the spectral analysis can be misinterpreted. Transformations of data is imperative to avoid such errors
If the data are not “stationary”, then some characteristics of the time series will be changing over time like mean, variance, period or amplitude. In all such cases, simple harmonic analysis would lead to incorrect estimates.
The takeaway from this chapter is that, Harmonic analysis is useful if you are already aware of the period and you are certain that the series is stationary.
Chapter 5: Periodogram Analysis
Periodogram connotes a visual for some people and model for some people. Both have the same interpretation though. In the case of visual, you plot the frequencies Vs. Intensity/ Mean Square. In the context of a model, the null hypothesis is that the time Series is white noise, meaning the proportion of variance represented by various frequencies is nearly constant (You can use fischer.g.test in R to test for white noise in the spectrum).
The formal model for the periodogram is an extension of the harmonic analysis model. Instead of representing the time series as the sum of one periodic component plus noise, the periodogram model represents the time series as a sum of N/2 periodic components. One can use optimization routine to get the estimates for the of the model.For small N values it works well. But if you use an optimization routine to estimate parameters of a model for large N, you soon realize that the coefficients are very unstable. What is the way out ? One can turn this problem in to Fourier series coefficient estimation. The Fourier series are chosen such that the longest cycle fitted to the time series is equal to the length of the time series. The shortest cycle has a period of two observations. The frequencies in between are equally spaced and they are orthogonal to each others, that is, they account for non overlapping parts of the variance of the time series. FFT turns out to be an efficient way to estimate the parameters of the model.
So, to reiterate: Let’s say you have a time series of length N= 2k+ 1 and you want to fit lets say K frequencies to it, you can write down k equations with sine and cosine terms. Each of these sine and cosine terms will have Ak and Bk as coefficients. You can regress with the data but the coefficients will be very unstable. Hence one needs to use Fourier transform to work these coefficients efficiently
Firstly, before doing any periodogram analysis, you need to remove the trend component. One nice way of explaining the rationale behind removing the trend component is this: Partitioning the variance in the periodogram will have longer cycles or lower frequencies that will account for larger shares of variance in the time series. For example if the cycle is of length 14 time units, then one can ideally fit 7 frequencies 14/1, 14/2, ..14/7. If you do not remove the trend, lower frequencies will spuriously dominate the analysis.
Consider this dataset mentioned in the book. This looks like it has a period of 7 days (units for t are days). If we are going to represent the series by an aggregation of sines and cosines, we need to calculate the amplitudes of the same
The above visual in the book can easily be reproduced by one line of R code
plot( (1:(N/2) ) / N, ( ( Mod( fft( x ) )^2 / N ) *( 4 / N ) )[2:8] , type = "l", ylab = "Intensity", xlab = "Frequency" ) # N=14
This terse line is easy to code but it took some time for me to actually understand the stuff to code it. The x axis is labeled frequencies that start from 1/14 to 7/14. Basically if you are given 14 days points you can fit 7 different types of frequency waves 1/14, 2/14. 3/14, 4/14, 5/14, 6/14, 7/14. Now the above plot is called a periodogram plot. What’s a periodogram plot? Well if you look at the initial dataset, fit a set of sine and cosine waves with each of the above frequencies, you get a pair of coefficients for sine and cosine waves for each of the frequency. Thus there are 7 pairs of coefficients for 7 frequencies. If you find the square those coefficients and add up, you get one value for each frequency. This plot of frequency vs. amplitude square for each of the frequency is called periodogram. Now if there are more number of frequencies to fit, the computation of coefficients for each of the frequencies becomes cumbersome. Thanks to FFT, one can get to periodogram easily. How do you do that? R gives unnormalized Fourier coefficients, sqrt( n) factor is missed out and a factor of exp(2*pi*i*k/N) for each frequency gets added in the fft coefficients. sqrt( n) factor missing was easy for me to understand but the fact that there is an additional factor that was appearing in the FFT coefficients confused me a lot till I understood it properly. In any case since we are interested in the intensity here, all that matter is the correction for square root of n. So, in the one line R code that you see, the division by N is done to adjust that factor that R misses. Now there is an additional 4/n factor that is applied to Intensity from Fourier series. This 4/n is the magic factor that takes the intensity and converts it to periodogram values. Why should one multiply by 4/n? It will take little bit of math to figure it out. Try it out on a rainy day.
FFT algo is obviously one of the finest algos in mathematics. Having no knowledge of DFT, it took me sometime to understand the various aspect of DFT. Someday I will write a post that will summarize my learning’s from DFT. FFT is super powerful algo. Having said that, different packages implement in different ways. Hence it is always better to go over your statistical software manual to understand the output values of the algo. It took me some time for me to understand this stuff in R. Thankfully NumPy also has the same implementation as R for FFT, i.e. the interpretation of the output is same in R and NumPy.
What are the limitations of Periodic analysis?
The major criticism of periodograms is that their ordinates are subject to a great deal of sampling error on variability. Because the number of periodogram ordinates (N/2) increase along with the N of the time series, merely using a larger N does not correct this problem. Spectral analysis refers to the family of statistical techniques used to improve the reliability of the periodogram. Power spectrum is smoother periodogram. What’s the disadvantage of Spectrum analysis? Basically spectrum does not correspond to partitioning the variance.
The takeaways from this chapter are the following
Advantages of Periodogram over fitting single frequency
How to estimate the frequencies given the data?
How to use FFT to quickly estimate the frequencies?
How to estimate the regression coefficients of sine and cosine curves from FFT output?
How do you scale the intensity values from FFT to get Periodogram values?
Chapter 6: Spectral Analysis
Spectral analysis is a fancy term that means “smoothing” the periodogram. If you have worked on any smoothing procedures in statistics, you will immediately realize the connection here. Basically , you take a periodogram and smooth it. Well if you have studied and worked with density estimation in statistics, most of the methods can be carried over to Spectral analysis. In the case of density estimation, you have frequency counts for various intervals of data and your intention is to create a smoothed density estimate. In the case of spectrum analysis, you have a set of intensities across various frequencies and your objective is to create a smoothed spectrum estimate. What are the advantages ? Well the obvious one is the reduction in sampling error. What do you loose by estimating a continuous spectrum? If it is over smoothed, you might lose the frequencies visible in the periodogram.
How do you smooth the periodogram? There are two things that you need to decide, 1) width of the smoothing window 2) weights that will be applied to the neighboring observations. The combination of width and smoothing gives rise to various types of windows. It is up to the analyst to choose a good way to smooth the periodogram. There is an alternative way to get spectrum estimate. You first calculate the Auto correlation function for M lags. You then do a FFT analysis on this lagged ACF. This method is called Bartlett window. One of the advantages with this method is that you can select the frequencies that you want to use in FFT.
The chapter also discusses confidence intervals around spectrum estimate. The upper and lower bounds of a confidence interval can be computed using a chi-square distribution. In order to set up this confidence interval, it is necessary to determine the equivalent degrees of freedom(edf) for the spectral estimates. I was happy to see this concept of equivalent degrees of freedom coming up in this context. The first time I came across this concept was in relation to local regression. In a normal linear regression framework, degrees of freedom are important. But in the case of local regression, they become mightily important. It is a criterion for selecting a model from competing local regression models. Anyways coming back to this book…. Prior to smoothing, each periodogram intensity estimate has 2 degrees of freedom; after smoothing, the estimated new edf will vary depending upon the width and shape of the smoothing window. These confidence intervals are then used to test the null hypothesis of white noise.
The takeaway from this chapter is
When to use Periodogram? and When to smooth it?.These are the two key decisions that a data analyst should make.
It is always better to go back to the time series, simulate data based on the frequencies shortlisted and check how the signal matches with the original signal
If you have many time series for the variable that you are interested, periodogram might be a better choice as the sampling error can be reduced by studying various time series. The biggest advantage of periodogram is that you get a sense of variance contribution of different frequencies. You also control Type I error for white noise null hypothesis
If you have a single time series, spectrum analysis might be preferred to duck against sampling error. However there are disadvantages as Type I error is inflated. You no longer can talk about partitioning of variance.
Chapter 7: Summary of Issues for Univariate Time-Series Data
This chapter talks about the various elements that should be a part of univariate spectral analysis report:
Clear description of time series data, the sampling frequency and the length of the time series
Preliminary data screening ( norm distribution checks, transformations etc.)
Trend analysis performed if any
Clear specification of the set of frequencies that are included in the periodogram or spectrum
If periodogram is displayed, clear explanation of the intensity or whatever that is plotted on the y axis needs to be provided
Large values in periodogram or Spectrum analysis should be followed up with statistical significance tests.
List the major or most prominent periodic components. For each of the component, it is appropriate to report the proportion of variance explained , estimated amplitude, and the statistical significance of each component
8. Better to show the fitted time series on the original time series
One of the common problems facing a data analyst is whether to aggregate various time series and then perform a spectral analysis, or, perform spectral analysis on the individual series and then aggregate the result for each frequency. The best way to proceed is to perform spectral analysis on each of the series individually to ascertain any similarities or differences between the spectrum. Depending on the context of the problem, one might be interested in focusing on the differences or similarities. If similarity is the focus of the study, one can always aggregate series for a certain period length and then do a spectral analysis on this aggregated time series.
There is a final section in this chapter that deals with changing parameters. Real life data is noisy and usually stochastic in nature. In all such cases the simple assumption that periodogram is valid for the entire interval of data is simply too naïve. Three methods are described in this chapter. First is the most basic version where the time series is split in to certain chunks and parameters are analyzed to see any obvious breaks in the trend. There are two other methods discussed in the section that I could only understand intuitively, they being, complex demodulation and band-pass filtering. Need to go over these methods from some other book to get my understanding more concrete.
Chapter 8: Assessing Relationships between Two Time Series
This chapter deals with various methods that are used to analyze relationships between two time series. Most common statistic that is reported to capture the join behavior is the correlation estimate. However there are many drawbacks to taking this statistic at face value. What are they?
The observations in the time series data are not independent, so the estimate is biased
Most of the financial time series have serial dependence between time series. The possible existence of time-lagged dependency within each time series should be taken in to account
The time series could be strongly correlated at a certain lag and hence correlation might not be able to catch it.
In the standard econometric analysis, there is a procedure called “prewhitening” that is done by the analyst before studying correlation and cross correlation. Prewhitening involves removing trend, cyclical, serial dependence etc. from each of the time series and studying the correlation between the residuals from the two time series. The rationale behind this is that all the factors that artificially inflate correlation are removed. So, the correlation estimate that finally gets computed captures the moment-to-moment variations in the behavior. Despite prewhitening, the residuals could still show a correlation because of some environmental factor. So the analysis is not perfectly air-tight at all times.
The book suggests a departure from the above standard methodology. It takes a view that pre-whitening kills a whole lot of information and the analysis that reports on shared trends and cycles might be of great interest to the data analyst. Hence the book suggests that the data analysts should report variance contributions of trend, cyclicality, components besides doing analysis on the residuals like lagged cross correlation estimates. By looking at periodogram output, the data analyst can ask the following questions:
Questions like these help in better understanding the relationship amongst the time series.
The takeaway from this chapter is that sometimes shared trends and cycles of the time series contain crucial information. In all such contexts, tools from spectral analysis such as Cross-spectral analysis will be immensely useful.
Chapter 9 :Cross-Spectral Analysis
A cross-spectrum provides information about the relationship between a pair of time series. The variance of a single time series can be partitioned in to N/2 frequency components in univariate spectral analysis. In cross-spectral analysis, we obtain information about the relationship between the pair of time series separately for each of the N/2 frequency bands.
What does the Cross-Spectral analysis address?
What proportion of the variance in each of the two individual time series is accounted for by this frequency band?
Within this frequency band, how highly correlated are the pair of time series?
Within each frequency band, what is the phase relationship or time lag between the time series?
General Strategy for Cross-spectral analysis?
Careful univariate analysis of the individual time-series needs to be done. Are there trends in one or both of the time series? Are there cyclical components in one or both the time series?
Prewhitening each of the individual time series. Remove the trend component before doing cross-spectrum analysis. Do not treat cycles as spurious source of correlation between two time series. Cycles are treated as important components of the time series analysis.
Use the residuals in cross-spectral analysis
The above strategy produces the following estimates (for each frequency f):
The percentage of variance accounted for by f in time series X
The percentage variance accounted for by f in time series Y
The squared coherence between X and Y at f
The phase relationship between X and Y at f
Now this is a lot of data and one needs some guideline to work with this massive dataset. Well the thumb rule is to proceed from univariate spectral to coherence data to phase data. Just because coherence numbers are high, does not mean that we have stumbled on to something important. What’s the point if the coherence is high for a frequency f that explains very little of variance in the individual spectra? Similarly the phase numbers make sense only if there is high coherence.
What are the conditions under which we can conclude that we have evidence of synchronized cycles between two time series? The book gives a good list of the conditions.
A high percentage of time series X is contained in a narrow frequency band
A high percentage of the power of time series Y is contained in this same narrow frequency band
There is a high coherence between X and Y in this same frequency
The phase relationship between the cycles in X and Y can then be estimated by looking at the phase for this frequency band.
The actual computation of the cross-spectrum numbers are kind of mentioned at 10000ft level and a book by Bloomfield is mentioned as the reference. I should find time to go over this book by Bloomfield “Fourier Analysis of Time Series data “.
The basic takeaway from this chapter is: If the squared coherence is not significantly different from 0, then it means that the two time series are not linearly related at any frequency or any time lag. If there are high coherences at some frequencies, then this may indicate a time-lagged dependence between the two time series; the length of the time lag can be inferred from the phase spectrum.
Chapter 10: Applications of Bivariate Time-Series and Cross-Spectral Analyses
This chapter presents two examples that illustrate the various concepts discussed in Chapter 8 and Chapter 9. The first example deals with prewhitening two time series and doing cross correlation lag analysis. The analysis is logically extended to form a linear regression model between the residuals of the time series. The second example mentioned in the chapter deals with Cross-Spectrum analysis that highlights the concepts of coherence and phase mentioned in Chapter 9. One problem with working with the examples in this chapter is that you can only read about them. The data is not presented in a csv or some online file. They are all part of appendix. Now typing 450 data points in to csv to replicate author’s analysis is a stretch. In any case the analysis is presented in such a way that you get the point.
Chapter 11: Pitfalls for the Unwary: Examples of Common Sources of Artifact
This chapter is one of the most important chapters in the entire book. It talks about the common pitfalls in interpreting the spectral analysis output.
It is easy to produce periodogram or intensity plot for each frequency, thanks to powerful FFT algos in statistical software packages. Given this situation, it is easy to make a mistake in interpretation.
Testing the Null Hypothesis – No significant frequencies across the Periodogram
For a simulated white noise, none of the individual frequencies account for statistically significant variance (Fischer test – 0.29)
Problems in interpretation if the data is not detrended
The periodogram shows a statistically significant a low frequency component. This example clearly shows the effect of not using detrended data for spectral analysis. A strong low frequency dominates everything.
Erroneous inclusion of baseline periods
In this case, erroneous inclusion of baseline periods before and after the actual experimental data produces spurious peaks in the periodogram.
The above visuals show the effect of undersampling, i.e. the interval length is not an integer multiple of the underlying frequency. The top two graphs are for the data with the correct sampling whereas the bottom two graphs are the result of undersampling. Clearly the periodogram (bottom-right) shows more peaks than those present in the true data(top-right)
There is also an example of outliers spoiling the periodogram show. Overall, a very interesting chapter in the book. It warns the data analyst against misreading and misinterpreting the periodogram data.
The basic takeaway of the chapter is: It is easy to misinterpret periodogram output because of the presence of trend, outliers, boxcar waveforms, leakage etc. Out of all the culprits, aliasing is difficult to identify and work around. Typically one needs to over sample to be sure that the periodogram represents the correct output. Sampling frequency and time-series length become very crucial in estimating the true frequencies underlying the data.
Chapter 12: Theoretical Issues
The final chapter of the book deals with some of the issues that any data analyst would face in using methods described in the book, i.e.
Is the underlying process deterministic or stochastic in nature?
Is there a need to presume cyclicity in the data?
How to choose between doing a time domain analysis Vs frequency domain analysis?
The book gives the reader an intuitive understanding of Spectral analysis. One usually is exposed to time domain analysis in some form or the other. Looking at data in frequency domain is like having a new pair of eyes. Why not have them? The more eyes you have, the more “fox-like” skills (Hedgehog Vs. Fox) you develop in understanding data. With minimal math and equations, the book does a great job of motivating any reader to learn and apply spectral analysis tools and techniques.
July 29, 2012
Posted by safeisrisky under Books
The author begins the book with a slew of examples involving predictions that never materialized. These examples span a wide range of fields like economics, social sciences, finance, politics, etc. The author also sneaks in examples of his parents and grandparents lives to show how their lives panned out in ways that were completely unpredictable. Well, Do these examples prove anything ? You can quote volumes of predictions going wrong , but if you ask the people who predicted them, they always seem to have a defense. What are the common answers that experts give, when asked about the failed predictions
“I was almost right”
- It would have happened if I was not blindsided by this event
“Self-negating prophecy” - I predicted the right thing but since there was a massive remediation to the prediction, it never came true ( example – guys who predicted that Y2K would spell disaster)
“Wait and See Twist” – It has not happened as yet but it will soon happen
Defense that involves carefully parsing the forecast and saying that original prediction meant something else and was far more elastic in intent.
So, even if one lists down a set of predictions that went wrong , it is rather difficult to actually rigorously state that the prediction went wrong. Again whoever quotes a set of predictions going wrong is accused of cherry picking specific predictions. What’s needed is the ”rate of failure" to understand whether expert predictions work ? But to compute the rate of failure, we need hits too . How do you identify hits ? What if the an expert has a poor failure rate but hits bulls eye for the event that we picked ? How does one identify a dart thrower Vs an expert giving right prediction ? So, even though it is easy to identify certain specific instances where predictions have gone wrong , if we take logic and evidence in to consideration, "figuring out how experts are at predicting future" is a challenging question. One might have to dedicate one’s whole life to carry on such an experiment and see what results it throws up. One man has done this feat. Philip Tetlock, a professor at Haas School has conducted one such experiment spanning over 2 decades collating about 27,450 predictions and their outcomes by interviewing and collecting prediction statements from a diverse set of experts in various fields. It is possibly one of the most scientific experiments conducted to answer the question of expert’s rate of failure. The result of Tetlock’s experiment was that expert guesses were as good as random guesses,i.e a dart throwing monkey. He also noticed two other things. There were experts who did pathetic as compared to coin toss. The other interesting thing that he noticed was the experts who had a lower rate of failure in their predictions has one thing in common. They had a different style of thinking.
The experts who were more accurate than others tended to be much less confident that they were right.
This meant that they had no prior template to fit the world events. They were self-critical about their own opinions as well as others. Basically they predicted with a healthy dose of skepticism towards any predictions. This book basically tries to discuss this specific style of thinking using the analogy of Hedgehog Vs Fox.The sum and substance of Tetlock’s study was " Foxes beat Hedgehogs".
The unpredictable world
The author says that predictions work extremely well when the system governing the response variable and its predictors follow a linear model. The planetary system, occurrence of tides, etc. are all systems that are linear in nature, i.e the equation is in some linear form and can be estimated with surprising accuracy. So ,we are in position to exactly say the time for a solar eclipse at a certain location on earth but are terribly poor at predicting events like demographic, political and social events. Why? The entire system is non-linear and feedback dependent. Slight changes to the input data creates a great model uncertainty. The author makes a strong argument against modeling any non-linear system, i.e pretty much every forecast that we see in the media.
This chapter concentrates on two areas of predictions,demographic predictions and oil price predictions. Through a series of predictions , a case is made that non-linear systems are dependent on "monkey bite" moments, a phrase that means that trivial events can have unimaginable consequences. Well the actual phrase was used by Winston Churchill when he wrote that a money-bite caused a war between Greece and Turkey. Its the butterfly effect. The chapter ends with a clear message , i.e price of oil is fundamentally unpredictable and so are demographic trends. However both the fields employ tons of experts to dish out opinions and report. Why ? Answer is simple : There is a demand for forecasts , i.e. ‘Seer – Sucker theory'(No matter how much evidence exists that seers do not exist, suckers will pay for the existence of seers).
In the minds of experts
This chapter talks about human brain and why it is hardwired in a way to see patterns when they aren’t. It cites an example of Arnold Tonybee , an expert who became insanely famous predicting stuff and was subsequently criticized by many when his predictions failed to materialize. Centuries of human brain development has equipped to see patterns when there are. Failing to spot real patterns is a matter of life and death. However evolution has not put a penalty or hardwired the brain in the case of Type II error, seeing a pattern when there isn’t. So, basically human brain does not have intuitive sense of randomness. Hence we come across people who keep on predicting stuff despite knowing that the underlying process is random.
What about experts ? What’s going on in the mind of experts ? The author uses Philip Tetlock’s conclusion that Experts in a specific field had a greater rate of failure than people outside the field,i.e hedgehogs fare badly as compared to foxes. As the saying goes, "Theory is the root of all Evil", hedgehogs are blinded by confirmation bias. As they amass more and more knowledge, they are blinded by their own subconscious beliefs and start picking and selecting information that suits their overall theory. In a classic book on Volatility by Ricardo Rebonato, I came across a similar analogy. Option valuation can be done in two ways, one an option-replication argument (the "hedgehog approach") , the second is a "fox approach" where you don’t stick to one theory and incorporate a healthy dose of randomness in valuing and trading of options. People who stick to option-replication argument are typically seen in academia spinning yarns(theories), all basically resting on this one hedgehog theory of option-replication. In fact Taleb in one of his books says that pricing and trading anything beyond plain vanilla European options is very complicated and error prone and any amount of math cannot come to our rescue.
The takeaway from this chapter is that Foxes are right about a lot of things. Hedgehogs are mostly wrong. They are deluded despite their brilliance.In fact they are deluded because of their brilliance
Books like these made me wonder," Whether there is money to be made in long only investing?" As they say , Stocks Are Stories, Bonds Are Mathematics. In my work environment, I see people around me who have invested their lives in stories(stocks) and have chosen "long only stock picking" as a career. They religiously weave stories in their minds by following stock specific information and talk about them with their tribe. With an increasingly non-linear world around, where "monkey-bite" moments have far reaching consequences and where info overload is leading to more and more confirmation-bias, it looks like hedgehogs are going to lose out to foxes.
The experts agree : expect much more of the same
This chapter talks about findings from behavioral economics and gives examples of status quo bias, anchoring-and-adjustment bias, availability heuristic, representative heuristic. The rationale behind mentioning these biases is that author wants to make a point that neither hedgehogs nor foxes are good at getting over these biases always. He also debunks `scenario-planning’ as another bogus exercise and if facts are to be taken at face values, umpteen scenario analysis sessions at various MNCs and other firms around the world , hardly have given rise to any meaningful correct predictions. The problem with scenario planning as explained in this chapter is that, foxes tend to run to the other extreme of hedgehog’s stance. It is mainly to draw attention to the people who are carried away from status-quo bias. However by going to other extreme, they are often blinded by representative bias.
Unsettled by uncertainty
This section mentions that everyone one of us is unsettled by uncertainty and it is hard to live and accept uncertainty. Hence the craving for experts and their predictions. Also for a guy who is already hedgehog for a certain number of years, it is difficult to crossover and become a fox as he/she senses a financial and psychological cost to it. I came to know from this book that Robert Schiller who is known to have predicted real estate crisis hardly did any thing in the meetings and committees that he was part of. In his own words, there was a great disincentive if he would have voiced his opinions more aggressively. Basically that means he was trying to fit in and align with his personal incentives. Nothing wrong with it. May be that the way world is. The point to be noted is , status-quo bias is tough to fight.
Everyone loves a hedgehog
The content in this chapter revolves around this behavior of most of us – ‘We tend to forget misses and focus on hits’. All the experts who have failed in predictions are forgiven easily.The TV channels, the columnists, and many more people do not criticize them. However they latch on to their hits and do not even mention their past failures in prediction. Its like
Heads : I win , Tails : You forget that we had a bet.
This coupled with hedgehogs who exhibit oodles of confidence in their predictions makes us all suckers for their statements.We give in. The chapter is interesting as it cites many examples like Peter Schiff, Robert Shiller( who were hailed as people who predicted crisis correctly) and shows that in each of the cases, they really did not make the prediction in the way media has projected it. In a way, it seems to say that we are suckers for certainty and there are hedgehogs who basically deliver it. Looks like a perfect demand supply equation. Who cares whether the predictions went right or wrong.. May be the poor investor in the fund whose manager made a call after listening to hedgehogs.
When prophets fail
This chapter talks about the cognitive dissonance and its role in experts life or a hedgehogs life. Cognitive dissonance , in modern psychology, means rationalizing the events to suit one’s philosophy and thinking. When hedgehogs find that their predictions have not come true, cognitive dissonance comes in to play and then they rationalize their failure at prediction. Two most common ways to duck their failure is 1) time is still ticking, they got it wrong on the time scale. 2) they misremember things to suit their needs (hindsight bias). The author cites a few examples of cognitive dissonance in play. He goes on to say that not only hedgehogs but also the fans of hedgehogs can undergo cognitive dissonance and they conveniently misremember stuff.
The end is nigh
The last chapter praises fox thinking and says that there are three components to Fox mode of thinking
: By aggregating opinions from diverse range of sources, one is better prepared to acknowledge uncertainty of a specific prediction
Meta cognition: Constantly think about how you think
Humility: Have a healthy dose of skepticism about any forecast that you make. In-case any of your forecasts is a hit, be humble to accept that chance played a role ( George Soros constantly does that)
By incorporating these components in to our everyday thinking, it is likely that we will be prepared to face the future in a better way.
The book is an accessible introduction to Philip Tetlock’s research. The author puts a spin on one of the Tetlock’s conclusion, i.e there are two types of experts, Hedgehogs and Foxes. Hedgehogs believe in one big theory whereas Foxes have a more elastic kind of thinking. The takeaway from this book is that developing a Fox style of thinking is a better alternative in this non-linear unpredictable world. The book does not criticize experts(Hedgehogs) but gives an explanation about a) why they dish out opinions with such certainty? and b) Why do we constantly seek out for such experts ? After reading this book, I guess it will make make any reader take "predictions" with a healthy dose of skepticism, if he/she is not already doing so.
July 22, 2012
The book is written by Julian Faraway , a Statistics Professor at University of Bath. The book seems to be culmination of lecture notes that the professor might have used over the years for teaching linear models. Whatever be the case, the book is wonderfully organized. If you already know the math behind linear models, the book does exactly as it promises in the title,i.e, it equips the reader to do linear modeling in R.
Let me attempt to summarize the book:
Chapter -1 Introduction
The chapter starts off with emphasizing the importance of formulating any problem. Formulating the problem in to statistical terms is the key aspect in any data analysis or modeling exercise. Once you formulate the problem, answering it is mostly a matter of mathematical or experimental skill. In that context, Data cleaning is THE most important activity in modeling. In formulating the problem, it is always better to understand each predictor variable thoroughly. The author talks about some basic data description tools available in R. He also mentions Kernel density estimation as one of the tools. I was happy to see a reference to one of my favorite books on smoothing, a book by Simonoff in the context of KDE. The chapter talks about the basic types of models that are covered in various chapters, i.e Simple regression, Multiple regression, ANOVA, ANCOVA.
Chapter – 2 Estimation
The chapter defines a linear model, i.e. parameters enter linearly, the independent terms can be present in any non linear way. Predictors need not be linear is the basic takeaway. The geometric representation of linear model framework is illustrated below:
It is clear from the above visual that regression involves projection on to the column space of the predictor space. If it is a good fit, the length of the residual vector should be less. To be precise, Linear Regression partitions the response in to systematic and random part. Systematic part is the projection of Y on to subspace spanned by the design matrix X. The least square formulae are quickly derived, i.e beta estimate and its variance, fitted values and their variance, estimate of error variance. Estimate of residual variance is a formula that depends on Residual sum of squares and residual degrees of freedom. To derive this little formula, one needs to know the connection between trace of the matrix and expected value of a quadratic form. There is a section on Gauss Markov theorem that basically states that linear regression estimate of Beta is Best Linear Unbiased Estimate. Least squares framework also matches MLE estimate if the error are assumed to be iid.
There is also a subtle point mentioned that is missed by many newbies to regression, i.e. Coefficient of determination / R squared has an obvious problem when the linear model has no intercept. The intuition behind R squared ( 1- RSS/TSS) is : Suppose you want to predict Y. If you do not know X, then your best prediction is mean(Y) , but the variability in this prediction is high. If you do know X, then your prediction will be given by the regression fit. This prediction will be less variable provided there is some relationship between X and Y. R squared is one minus the ratio of the sum of squares for these two predictions. Thus for perfect prediction the ratio will be zero and R squared will be one. However this specific formula is meaningless if there is no intercept in the model.
The chapter then talks about Identifiability in a model that typically arises when the model matrix is not full rank and hence not invertible.One way to check for identifiability problem is to compute eigen values of the design matrix. If any of the eigen values is close to 0 or 0 , then you have a problem of identifiability and hence the problem of collinearity. Clear lack of indentifiability is good as software throws up error or clear warnings. But if there is a situation which is close to unidentifiability, then it is a different problem where it is the responsibility of the analyst to interpret the standard errors of the model, and take a call on the variables that have to be filtered away or dropped.
The best thing about this chapter is that it shows what exactly happens behind lm. I remember a conversation I had with my friend years ago about an interview question. My friend was mentioning about this favorite question that he always used to ask in interviews, while recruiting fresh graduates: “ Here is a dependent variable Z taking 3 values z1,z2,z3,z4. Here are two independent variables X and Y taking values (x1,x2,x3,x4) and (y1,y2,y3,y4). If you fit a linear model , give me the estimate of predictor coefficients and their variance?. “ Starting from a very basic values for which the linear model works well, he used to progressively raise the bar with different values of Z,X,Y. For people who are used to software tools doing the computation, such questions are challenging. Well, one does not need to remember stuff, is the usual response. Ok..But there is always an edge for a person who knows what goes on behind the tools. This chapter in that sense is valuable as it actually shows the calculation of various estimates, their variances using plain simple operations and then matches these values to what lm() throws as output. I guess one simple test whether you know linear regression well is as follows:
lm(y~x,data) throws the following output :
[coefficients ,residuals, effects ,rank , fitted.values, assign, qr, df.residual, xlevels, call, terms, model ].
Can you manually compute each of the above output using plain simple matrix operations and arithmetic operations in R?
summary(lm(y~x, data)) throws the following output :
[call, terms, residuals, coefficients, aliased, sigma, df, r.squared, adj.r.squared, fstatistic, cov.unscaled]
Can you manually compute each of the above output using plain simple matrix operations and arithmetic operations in R?
Well, one can obviously shrug the above question as useless. Why do I need to know anything when lm() gives everything. Well it depends on your curiosity to know things. That’s all I can say.
Chapter – 3 Inference
The chapter starts off with an important framework for testing models, i.e. splitting sum of squares of a full model in to orthogonal components. Let me attempt to verbalize the framework. Lets say you come up with some reduced model and you want to check whether the reduced model is better than the full model. The residual of the large model is compared with the residual of the small model. If the difference between the two models is less, then reduced model is preferred. Now to standardize this number, it is divided by residual for the large model.
If this ratio is small, then there is a case for reduced model. This ratio, RSS_red – RSS_full / RSS_full (need to adjust for degrees of freedom) is the same test statistic that arises out of likelihood-ratio testing framework. The best thing about this framework is that you don’t need to remember any crazy formulae for various situations. If you get the intuition behind the framework, then you can apply to any situation. For example if you want to test that all the coefficients are 0, i.e reduced model contains only intercept term , then all you have to do is calculated (RSS_Null – RSS_Model)/(p-1) / ( RSS_Model/n-p) , compare it with F critical value. If it is greater than F critical, you reject the null and go with the model. Infact this output is given as a part of summary(fit) in R. The good thing about this framework is that you can test the significance of a single predictor using this framework. Reduced model would be full model minus the specific predictor. This is compared to the Full model and the computed F value can be used to judge the statistical significance of the predictor. The chapter mentions I() and offset() functions in the context of testing specific hypothesis in R.
The chapter then talks about forming confidence bands for various predictor betas in the regression framework. The good thing about this text is that it shows how to compute them from scratch. So once you know those fundas, you will not hesitate to use functions like confint in R. Blindly using confint would be difficult, atleast for me.
Obviously once you have the model, the confidence interval for each of the parameters in the model, the next thing you would be interested is prediction. There are two kinds of prediction and one needs to go over this stuff to understand the subtle difference. In R, you can use predict function with the argument “confidence” to predict the mean of response variable given a specific value of the independent variable. You can use predict with the argument “prediction” to predict the response variable given a specific value of the independent variable. There is a subtle distinction that one needs to pay attention to.
There are many problems of applying regression framework to observational data like lurking variable problem, multicollinearity problem. The section makes it extremely clear that non-statistical knowledge is needed to make a model robust. The section also makes the distinction between practical significance and statistical significance
Statistical significance is not equivalent to practical significance. The larger the sample, the smaller your p-values will be, so do not confuse p-values with an important predictor effect. With large datasets it will be very easy to get statistically significant results, but the actual effects may be unimportant.
Chapter 4 – Diagnostics
The chapter starts off by talking about three kinds of problems that might arise in a linear regression framework.
First type of problem arises when the assumption that error term in the model is a iid constant variance model fails. How does one check this after fitting a model ? One way to check this is , draw a fitted vs residuals plot that graphically shows presence of nonstationarity if any. If there is non constancy in the variance, one can look at square root or log or some transformation of the dependent variable so that residuals plot look like random noise. Also a simple qqnorm of the residuals gives an idea of whether the residuals follow a Gaussian distribution or not. Sometimes one tends to use studentized residuals for the normality check calculations. It is always better to check these things visually before doing some statistical test like Shapiro.test etc. The other problem with residuals that can arise is the autocorrelation. One can use Durbin Watson test to test the autocorrelation amongst residuals in a linear model.
Second type of problem arises because of unusual observations in the data. What are these unusual observations ? Any observation that has a great say in the beta values of the model can be termed as unusual observation. One must remember that even though the error term is assumed iid, the variance of fitted values is dependent on hat matrix and error term sigma. So if the values in the hat matrix are high for a specific observation , then the fitted value and observed value be very close and hence might reflect an incorrect fit from an overall data perspective. These points are identified by their “leverage” property. Thus Leverage depends on the spread of the independent variables and not on the dependent variable. In R, the summary of a fitted model gives hat values. One can always use half normal plots to see highly levered observations.
The other type of unusual observations are outliers. The book mentions the use of cross validation principle to diagnose outliers. Well to apply cross validation principle , one needs to actually regress n times ,n being the number of data points. However a nifty calculation shows that a variation on studentized residuals, i.e externally studentized residuals are an equivalent measure to check whether an observation is outlier or not.
An influential point is one whose removal from the dataset would cause a large change in the fit. An influential point may or may not be an outlier and may or may not have large leverage. Cooks distance is used to compute the influence of a specific point on the slope and intercept of the model.
So, all in all it is better to produce 4 visuals in diagnosing a linear regression model, externally studentized residuals, internally studentized residuals, cooks distance and Leverage. Any unusual observations are likely to be identified with these visuals. I think one way to remember these 4 tests is to give it a fancy name, “The Four Horsemen of Linear Model Diagnostics”
The third type of problem arises when the structural equation assumed itself is wrong. The chapter talks about two specific tools to help an analyst in this situation, a) Partial residual plot(better for non linearity detection), b) Added variable plot ( better for outlier / influential obs detection) . The logic behind added variable plot is : Let Xi be the predictor of interest. Regress Y with all other Xs , compute residuals. Regress Xi with other Xs and compute residuals. Plot the two residuals to check influential observations. The logic being Partial residual plot is : Plot a graph between Xi and the residuals contributed by other Xs.prplot in faraway package does this automatically.
One of the best functions that can be used to get all the relevant diagnostic values of a fitted model is ls.diag function. ls.diag function gives hat values, standardized residuals, internally studentized residuals, externally studentized residuals, cooks distance, correlation, scaled covariance, unscaled covariance and dfits values.
Overall , the visuals and statistical tests mentioned in this chapter are very critical in diagnosing the problems in any linear regression model. It is easier to fit any model with powerful packages. However this easiness with which the model is fitted might result in sloppy models. Hence one must atleast perform let’s say half a dozen diagnostic tests on the output of the model and document it along the model results.
Chapter 5 – Problems with Predictors:
The chapter focuses on the problems that might arise with Predictors in a linear regression relation. The first problem is the measurement problem in the independent variable. It is likely that observational data display variation from their true values. The problem with measurement error in the independent variable is that the estimates for such variables are biased downwards. The section introduces a method called Simulation Extrapolation(SIMEX) method to deal with such a situation. The principle is very simple. You add random noise with different values of variance to the independent variable, estimate the coefficient for that independent variable in the regression equation. Subsequently try to estimate the relationship between the coefficient and error variance. It is given a fancy name SIMEX, simulation extrapolation method.
There is a section that talks about changing the scale of the variables so that coefficient are sensible. The basic takeaway from this section is that by scaling, the r square of the model, p value of the anova test remains the same, the scaled variable beta and standard error gets scaled. The final aspect that the chapter talks about is the collinearity. I remember vividly how difficult I felt to understand and explain this aspect to someone who had asked me to verbalize it. It was pretty embarrassing situation for me. Back then, I tried to understand and verbalize conditional number and VIF but couldn’t. Now when I revisit this topic, it all looks pretty straight forward. The estimates for the regression includes an inversion of cross product of design matrix with itself. This inversion exists only if the eigen values are large enough. Why ? Eigen values being large means that there exists a projection on to the subspace spanned by that eigen vectors.So, conditional number is basically a thumb rule that compares the largest eigen value to a specific eigen value and takes a call on whether there is collinearity. The other metric to check collinearity is VIF, variance inflation factor. In simple terms, all you need to do is regress a specific variable in the design matrix with rest of the variables of the design matrix, find the r squared and if its high, then it is likely that there is collinearity. Again straightforward explanation and computation. The R package associated with this book, “faraway”, has a function vif() that does VIF calculations for all the variables at one go.
Chapter 6 – Problems with Errors
The chapter starts off with a discussion about the possible misspecification of the error model in the linear regression framework. The errors need not be iid but might have a covariance structure amongst them. For such cases, one can use Generalized least squares. Another case is where the errors are independent but not identically distributed. For such cases, Weighted least squares can be used. The chapter talks about both these methods.
For GLS, you start off with a basic estimate of covariance structure, fit a linear model and then recalculate the covariance structure. You do this a number of times till the covariance structure converges. Easier way is to use nlme packages that does this automatically. For WLS, one can use the weights option in the lm function to specify the weights that need to be used for the regression.
The chapter then discusses various methods to test lack of fitness of a model. One obvious way to increase fitness is to fit higher order terms. But this is always associated with higher uncertainty in the estimates. The section discusses a replication type regression check , which in the end , the author himself admits that it might not always be feasible.
The outlier detection using leave-one-out principle might not work in cases where there are many outliers. In such cases when the errors are not normal, robust regression is one of the methods that one can use. The methods covered under robust regression are least absolute deviation regression, Huber’s method, least trimmed squared regression. All these regressions can be done from the functions from MASS package. Robust regression methods give estimates, standard errors only. They do not provide F value and p value for the overall model. Least trimmed squared regression is most appealing as it can filter out higher residual values. The downside to Least trimmed squared regression is the time taken for the regression. So as an alternative the author suggests simulation and bootstrapping methods. The section also gives an example where Least trimmed square method outperforms Huber estimate and normal regression.
Chapter 7 – Transformation
There are ways to transform the predictors or dependent variable so that the error terms are normal. It all depends on whether the error term enters in a multiplicatively or additively. For example a log transformation of the Y variable makes the error enter in a multiplicative manner and hence the interpretation of the beta coefficients are also different. If you fit a linear model, then the coefficients are interpreted as multiplicative in nature and hence for a unit change in predictor, the fitted variable changed in a multiplicative fashion. Log transformation is something that one can check to begin with, as it removes the non linearity in the relationship and makes it a linear relationship. YVonneBishop is credited with the development of Log Linear Models
There is also a popular transform called the Box-Cox transform that applies the best transformation possible so that log likelihood is maximum for the said transformation. There is a function called boxcox in MASS package that automatically does this.Box cox transformation applies a transformation for a specific lambda, so you got to choose the best lambda. Boxcox function in MASS gives a graph between lambda values and their log likelihoods. Looking at the graph, one can choose a specific lambda, apply the transformation and fit a linear model. Besides Box-Cox transformation, there are other types of transformations that one can look at such as logit, Fischer Zs transformation etc.
The chapter has a section on hockey-stick regression where separate lines are fitted for various clusters of data. The way to go about in R is simple. You code two separate functions for these clusters, use in the regular lm framework using I() function.lm() throws out the coefficients of the functions. There is also a section on polynomial regression , orthogonal polynomial regression and response surface model
The last section of this chapter talks about B-splines, an interesting way to build smoothness and local effects in to regression model. In the previous section, hockey-stick regression was used to cater to local clusters/local behavior. Polynomial regression was used to make the regression fit smooth. However the problem with polynomial regression is that it does not take local behavior in to account and the problem with hockey stick regression is that it is not smooth(discontinuous straight lines basically). A good mix of both features , i.e. smoothing and local effect is brought out by B-Splines. The concept is simple. You decide on a certain knots, fit a cubic polynomial between the knots and join them.
The chapter ends with a reference to a non parametric model , whose form is follows :
I think locfit is a good package that can be used to obtain the above non parametric model. Clive Loader has written a fantastic book on local regression. Will blog about it someday. I am finding it difficult to strike a balance between the books I am yet to read and books I am yet to summarize!. Don’t know how to strike the right balance.
For me, the takeaway from this chapter is the boxcox function available in MASS that one can use to get a sense of suitable transformation. Before applying any transformation , it is advisable to look at at least 4 plots , internal studentized residual plot, external studentized residual plot, hat value plot and cooks distance plot. Because sometimes by removing the unusual observations, you can get a decent linear reg fit with out any transformation whatsoever.
Chapter 8 : Variable Selection
There are two main types of variable selection. The stepwise testing approach compares successive models while the criterion approach attempts to find the model that optimizes some measure of goodness.
The chapter starts off with discussing Hierarchical models. When selecting the variables it is important to respect the hierarchy. Lower order terms should not be removed before the higher order terms. Just pause for a minute here and think abt the reason? Why shouldn’t Y = c + aX^2+ error be preferred over Y = c+ aX. I knew intuitively that it was incorrect to go with higher order terms with out the representation of lower order terms in the model. But never spent time on time on understanding the reason. The book explains it using a change of scale case where a model like Y = c + aX^2 becomes a model Y = c + bX + dX^2 with a simple change of scale. A change of scale in measuring a predictor variable should not change your linear model. That’s the reason why one must never consider higher order terms with out considering lower order terms. Thanks to this book, now I am able to verabalize the reason. As an aside, I got confused on reading the content under the heading “Hierarchical model”. Atleast to me, Hierarchical models are Bayesian in nature. Somehow the author seems to using this word loosely in this context. Or may it is ok to use this kind of terminology.
Three test based procedures are discussed for selecting variables, backward regression, forward regression and stepwise regression. The logic behind the procedures is obvious from their names. Backward regression means starting with all the predictors and removing one at a time based on their p values. Forward regression is starting with blank slate and incorporating the variable with least p value and continuing the process till there are no more variables to add. Stepwise is a mix of forward and backward. Even though these methods are easy to follow, the author cautions that these methods might not be completely reliable. The chapter asks the reader to keep the following points in mind :
Because of the “one-at-a-time” nature of adding/dropping variables, it is possible to miss the "optimal" model.
The p-values used should not be treated too literally. There is so much multiple testing occurring that the validity is dubious. The removal of less significant predictors tends to increase the significance of the remaining predictors. This effect leads one to overstate the importance of the remaining predictors.
The procedures are not directly linked to final objectives of prediction or explanation and so may not really help solve the problem of interest. With any variable selection method, it is important to keep in mind that model selection cannot be divorced from the underlying purpose of the investigation. Variable selection tends to amplify the statistical significance of the variables that stay in the model. Variables that are dropped can still be correlated with the response. It would be wrong to say that these variables are unrelated to the response; it is just that they provide no additional explanatory effect beyond those variables already included in the model.
Stepwise variable selection tends to pick models that are smaller than desirable for prediction purposes. To give a simple example, consider the simple regression with just one predictor variable. Suppose that the slope for this predictor is not quite statistically significant. We might not have enough evidence to say that it is related to y but it still might be better to use it for predictive purposes.
The last section in this chapter is titled , Criterion-based procedures. The basic funda behind these principle is as follows : You fit all possible models and select those models that fit a specific criterion, the criteria could be based on AIC or BIC or Mallow’s Cp or Adjusted R squared. I try to keep these four criteria in my working memory always. Why? I think I should atleast know a few basic stat measures to select models. Before thinking of fancy models, I want my working memory to be at least be on firm grounds as far as simple linear models are concerned.
The section kind of lays out a nice framework for going about simple linear modeling,i.e,
Use regsubsets and get a list of models that fit the predictors and the dependent variable
Compute Mallow’s cp for all the models, Adjusted R square for all the models
Select a model based on the above criterion
Use the four horse men of Linear Model Diagnostics
(my own terminology just to remember things) and remove any unusual observations
· Rerun the regsubsets again
As one can see, it is an iterative procedure, as it should be.
The author concludes the chapter by recommending criteria based selection methods over automatic selection methods.
Chapter 9 – Shrinkage Models
The chapter is titled,”Shrinkage methods” for a specific reason, i.e. the methods used in this chapter help shrink the beta estimates in a linear regression framework. Typically in datasets where there are a large number of variables, modeling involves selecting the right variables and then regressing the optimal number of variables. How to decide the optional number of variables? If there are let’s say 100 variables, and you proceed by either backward/forward/step wise regression or criteria based regression, the parameters are going to be unstable. This chapter talks about three methods that can be used in the case of higher number of predicates. The underlying principle is the same, i.e. find some combination of predictor variables and reduce the dimension so that linear regression framework can be applied on this reduced space.
The first method mentioned in this chapter is PCA regression. You do a PCA on the data, select a few dimensions that capture the majority of variance, do a regression with those components. The idea here is dimension reduction coupled with a possible explanation of the PCA factors. Not always there is a good explanation of PCA factors. How to choose the number of factors ? Scree plot is the standard tool. However I learnt something new in this chapter, a cross validation principle based selection of the number of factors. There is a package called pls that has good functions to help a data analyst combine PCA based regression and cross validation principle
The second method mentioned in this chapter is Partial Least Squares, the essence of the method is to find various linear combinations of predictors and use that reduced dimension space to come up with a robust model. The problem with this kind of model is that you end up building a model with predictors that are extremely tough to explain to others. The chapter ends with the third method, ridge regression.
Chapter 10: Statistical Uncertainty and Model Uncertainty
This chapter talks about one of the most important aspects in data modeling, i.e model uncertainty. The previous chapter of the book covered three broad areas
Diagnostics – Checking of assumptions-constant variance, linearity, normality , outliers, influential points , serial correlation and collinearity
Transformations – Boxcox transformations, splines and polynomial regression
Variable Selection – Testing and Criterion based methods
There is no hard and fast rule for following a specific sequence of steps. Sometimes diagnostics might be better first step, Sometimes Transformation could be a right step to begin with . I mean the order iin which the three steps , i.e Diagnostics, Transformations and Variable Selection needs to be followed is entirely dependent on the problem and the data analyst.
The author cites of an interesting situation , i.e asking his class of 28 students to estimate a linear model given the data. Obviously the author/professor knew the true model. By looking at each of 28 models submitted by his students he says that there was nothing inherently wrong in the way each of his student applied the above three steps. However the models that were submitted were vastly different from the true model. So were their root mean square errors and other model fit criteria.
The author cites this situation to highlight “Model Uncertainty” . Methods for realistic inference when the data are used to select the model have come to be known under “ Model Uncertainty”. The effects of model uncertainty often overshadow the parametric uncertainty and the standard errors need to be inflated to reflect this. This is one of the reasons why probably one should use different approaches to modeling and select those models that he / she feels confident about. It is very likely that a dozen modelers would come up with dozen different models. I guess for an individual data modeler, the best way is to approach the problem and frame it in various ways. May be use Bayesian modeling, use local regression framework, use MLE, etc. Build as many alternate models as possible and choose the predictors that concur amongst the majority, so it is more like Law of Large Models
I stumbled on an interesting paper by Freedman, titled “Some Issues in the Foundation of Statistics”, that talks about a common issue facing Bayesian and Frequentist modelers, i.e Model Validation. The paper is written extremely well and highlights the basic problem facing any modeler, “How to trust the model that one develops?”
Chapter 11 – Insurance Redlining – A case study
This case study touches on almost all the important concepts discussed in the previous 10 chapters. If a reader who wants to know whether it is worth reading this book, I guess the best way is to look at this dataset mentioned in this chapter, work for about 10-15 min on the data and check your analysis with the way author has gone about it. If you find the author’s way far more superior and elegant than yours, then it might be worth spending time on the book. In my case, the time and effort investment on this book has been pretty rewarding.
Chapter 12 Missing Data
What do you do when there is missing data ? If the number of cases are small as compared to data, you can safely remove them from modeling. However if your dataset is small or is the missing data is crucial, then one needs to do missing treatment. This chapter talks about a few methods like mean fill in method and Regression fill in method. It also points to EM algo that is more sophisticated and applicable to practical problems. My perspective towards missing data treatment changed when I was told that there are PhDs employed in some hedge funds whose main job is to do missing treatment for security data. So, depending on your interest , curiosity and understanding levels, you can actually earn your living by doing missing data analysis!.
Chapter 13 – Analysis of Covariance
This chapter talks about the case when there are qualitative predictors in the regression framework, basically the factor variables in R’s jargon. One can code the factor variable and use the same concepts, functions, packages that are discussed in the previous chapters. The chapter discusses two examples, the first example has predictor that takes only two levels and the second example has a predictor variable has more than 2 levels. The same combination of “Diagnostics-Transformation-Variable Selection” is used in a specific order to come up with a reasonable model.
Chapter 14 – One way Analysis of Variance
ANOVA is the name referred to a linear modeling situation when the predictor is a factor variable with multiple levels. One can use the same linear model framework for this case too. One thing to remember is that when one is doing pair wise difference comparison, Tukeys Honest significant test is a better way to test than individual pair wise comparison tests.
The last two chapters of the book talk about factorial designs and block designs. Atleast I don’t see using the fundas from these chapters in my work. So, I have speed read them to get a basic overview of the principles involved.
The good thing about this book is it comes with an R package that has all the datasets and several other functions used in the examples. Also the author makes use of about half a dozen packages. Have kept a track of the packages used by the author and here is the list of R packages used in the book :
This book has one central focus through out all the chapters – Equip the reader to understand the nuts and bolts of regression modeling in R. The book explains in detail all aspects of linear modeling in R. Hence this book is highly appealing to someone who already knows the math behind linear models and wants to apply those principle to real life data using R.
July 22, 2012
Posted by safeisrisky under Books
As early as 1997, the financial markets comprised blue chip stocks traded by specialists at NYSE , other stocks traded at NASDAQ by specialists and a small scale electronic system. Fast forward to 2012, the US market comprises 40 trading destinations. There are four public exchanges – NYSE, NASDAQ, Direct Edge and BATS. Inside each of these exchanges there are various destinations. NYSE has NYSE Arca, NYSE Amex, NYSE Euro next and NYSE Alternext, NASDAQ has three markets, BATS and Direct Edge have two market destinations with in themselves. There are toxic Dark pools.There are Internalizers – Citadels , Knight tradings of the world that execute trades with in their trading pools. The system, as you can see, has become extremely complex. Dark pools and internalizers accounted for 40 % of all trading volume in 2012. The pace of developments have been unbelievable. How did this all come about ? Why has the average holding time of a stock gone down from 8 month in 2000 to 2 months in 2008 and finally drop to 22 seconds in 2011 ? This book tries to answer some of the questions. This book is not so much about dark pools as it is about tracing the personalities behind some important firms in the modern high speed trading world.
The market structure in US has completely changed in the last decade. The following visual summarizes the various firms that came in to existence and the way they were gobbled up by 2007.
The book starts off with Mathisson of Credit Suisse ECN , addressing a set of HFT traders, algo traders of the who trade in the lit pool. He sounds off a big warning to the audience that ECNs are being populated with AI driven trading bots and are all feeding on big orders from mutual funds and retail orders. He says that this aspect is fundamentally going to alter the trading behavior of many. Rightly so, sensing that trade bots are ripping off, the big institutions , the buy side asset managers are increasingly turning to dark pools where there is anonymity and lesser chance of bots attacking the order.
Using this speech as a prop, the author then poses various questions on the kind of mess that financial system has become with fragmented trading destinations, hunter-seeker algos populating the lit markets, dark pools etc.
The book profiles Haim Bodek, a math grad turned quant trader. After putting in about a decade of experience (1997-2007), at Hull, at Goldman and at UBS, he starts a venture Trading Machines under the assumption that he has enough skills to beat the market with all the fundas acquired. All is well in the first two years where his firm rakes in good money. At one point in time, the firm averages 17000 stock trades 6500 option trades per day. However by Dec 2009 , his strategies and the firm suddenly start losing money. He thinks it is an internal bug in the system and tries to fix it.
Despite hours of toil, he never managed to fix it. In the end, he realizes that the bug is external to the system from a chance conversation with an exchange official. Strange order types are being preferred at the exchanges and thus were making his trades sitting ducks. Complex order types always managed to stay at the top of the book and hence were able to profit from the spreads. Trading Machines was still in the old limit-order, market-order world and all his algos were designed for such a world. In the past, Bodek had made money precisely by staying ahead of everyone in the order book, using strategies like "Size Game" that involved putting big market/limit order so that they always managed to go ahead of everyone in the queue. Sadly this time around, he was eaten by other firms who were getting ahead of his trades in the queue. He hadn’t noticed that world has changed rapidly with exchanges sleeping with the enemy(HFT firms).
This story does made me pause : A guy who had spent years designing and trading strategies to get ahead of everyone in the order book was somehow ignorant of these complex order types that were being preferred over limit and market orders. If he was following the developments so closely, How did he miss the new development ? One obvious answer I guess is that the market had become increasingly complex and he couldn’t keep pace with the developments.
He also stumbles upon a paper on 0+ strategy that was being employed by HFT firms even before he had started his company. These firms were using strategies that looked for opportunities where probability of plus 1 tick was higher than minus 1 tick and were pumping in huge volumes to capture all such opportunities. Basically they created orders which were kind of hidden in the order book and were untouched by RegNMS regulation. They fed on limit order trades. They sensed a whale in the market , immediately ran past the whale, bought the relevant stock, jacked up the price, turned around and sold it to the whale. This was just one of the umpteen strategies that were being employed by HFT firms. Bodek tries to save his firm after understanding the situation. However with options market adopting maker-taker model too , he is forced to shut his firm. The story about Bodek goes on to show that order book based strategies have become increasingly complex and volatile in returns. You cannot be sure of any strategy.
The author traces back the decade long mindboggling pace of developments to a few people , whom he calls “the plumbers of modern financial world”. The people whom the author covers in depth are the following :
A college drop out joins Russo securities as a runner. Soon his boss at Russo pulls him to join Datek Securities. At Datek, Levine finds that traders are using NASDAQ’s SOES system to beat specialists at their own game. Subsequently Levine builds a lot of tools for the traders. While specialists at NASDAQ tried to eat the spread from the orders generated from retail and institutional clients, Datek traders exploited the loopholes in system to get ahead of the queue and extract a fraction of the spread that the specialists were making. These fractions multiplied by a huge volume of trades made Datek traders immensely rich. All the traders kind of knew that a big reason for their success was the toolkit that Levine built. These tools spanned the entire gamut of activities in the daily life of a trader, i.e from bid-ask quote streaming to back end assignment of trades .These tools also helped Datek launch an entire business catered to day traders.
One thing that especially bothered Levine was that the traders at Datek were often on opposite sides of a trade and they were going to NASDAQ to get their order executed. He figured out that if he had a system that could match the orders internally, the traders need not go to specialists and the traders might have less transaction costs. Thus was born the idea of "Island", a order matching system outside the confines of NASDAQ. Its super fast and super cheap execution attracted a lot of order flow and by 1996 , almost 50% of the order flow to NASDAQ was from Island. Since Island’s USP was speed , it was ripe play ground for many a trading start ups to use technology to make money. Levine’s dream of eliminating the power of specialists/middlemen was being fulfilled by Island. Renaissance, Getco, Tradebot and many successful outfits that we know today , were big users of Island in the 1990s.
Levine’s another strategy that paid off was,"maker taker " model for the Island crowd. By agreeing to pay traders who provided liquidity and take fees from traders who demanded liquidity, Island created massive liquidity as it attracted HFT traders. An irony here is that : Levine started Island to remove specialist middlemen , but actually ended up creating a new type of middleman, the High frequency trader. The HFT guys, on the pretext of providing liquidity were actually making the cost of trading higher for retail and institutional clients.They were playing the rebate game. Island later merged with Instinet, creating one of the biggest trading venue. Subsequently Levine left Island as he could never fit in the big firm culture of the merged entity.
Jerry Putnam, a failed brokerage firm owner was trying to make headway out of his professional crisis. A chance encounter with a founder of a trading-software firm gave him the first exposure towards electronic trading. He wants to create an outfit for day traders from Chicago similar to Datek NY. He starts a venture focusing on applying strategies on SOES to make money. He recruits day traders, improves the UI of the trading systems to create faster and easier interface for day traders. External regulatory environment that lead Order Handling rules and the ECN creation rules, gives him an idea of creating an ECN out of Chicago. Soon, he realizes that there are already faster powerful players in the market. So he hits upon a counter intuitive and an ingenious idea, i.e. to route the orders to other pools. Instead of being market centric, Jerry Putnam chose to execution centric. In the hindsight it was a brilliant move that finally made his firm – Archipelago(name chosen to connote linked islands) a highly successful venture. It was later gobbled by NYSE thus marking a shift in the NYSE culture where it squarely acknowledged that electronic trading is the future.
Dave Cummings, a Chicago pit trader decides that there must be a better way to make markets and starts Tradebot , an automated market making machine. His idea is scoffed at umpteen number of places before he meets GETCO founders. GETCO immediately latches on to the idea and partners with Tradebot. With ECNs exploding, SECs order handling rules helping HFT traders, Tradebot and GETCO grow rapidly. With in a span of two years, they account for 10% of all trading in NASDAQ stocks. Tradebot and Getco part ways as they fail to agree on the business strategy. Dave Cummings does a lot of things to improve speed of execution like colocation that subsequently became a standard approach for any HFT player. Later, after a wave of mergers and acquisitions in the ECN space, Dave Cummings gets worried that he would end up paying more fees as a result of behemoths, i.e. Island-Instinet-NASDAQ and NYSE-Archipelago. He then starts BATS in Nov 2008 and it becomes a huge hit with the traders.
All the above stories are trying to say that : The idea and the work done by Joshua Levine in creating Island had so many repercussions that with in a span of 10 years the whole trading landscape changed.
The penultimate part of the book covers the various events in the last few years that have given rise to many people voicing the problem of hyper speed trading and fragile markets. The events that have been narrated more like a story format are Goldman’s programmer Mikhail Malyshev stealing code, Flash crash, Senator’s Kauffman’s fight against toxic trading, Arnuk and Saluzzi’s creating awareness amongst general public etc.
The book ends with a series of events happening around the world that seem to be making, the trading venues not only in US but through out the world , far more complex and terrifying. More and more people are turning to machine for trading, be it generating signals for arb or making markets.
The author concludes the book quoting the story of a start up that uses a completely automated machine for its fund management. The fact that the book ends with machine beating the market in 2010 and 2011 leaves a reader wondering about the future of trading. What is the author finally trying to say ? Is the `increasing automated trading’ a good sign or a bad sign ? Well I guess the answer depends on which part of the world you are and what are your vested interests? . I think the countries that have recently adopted dma or algo trading have an advantage of learning from the mistakes that have happened in the developed markets. Or Do they ? Aren’t ‘Greed’ and `Fear’ universal emotions and no system ever designed by a human is ever immune to them?. In any case one thing is for sure, the recent events have made a whole lot of people really skeptical about the utility of ECNs and HFTs. Are we trying to eliminate human middlemen(specialists) and substituting them with computerized middlemen(Bots)?
This book is not so much dark pools as the title goes.The book is primarily about the founders of Island, Archipelago, Tradebot, GETCO and other such firms that have had a tremendous impact on the US financial system. All the stories are peppered with just enough numbers, just enough anecdotes, just enough quirky facts, that the book sustained my interest till the very end.